[
https://issues.apache.org/jira/browse/PARQUET-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15281417#comment-15281417
]
Jerome BAROTIN commented on PARQUET-171:
----------------------------------------
Hi, I am very interested to view the patch. But, with the change, of hosting of
the parquet-mr project, the url is broken.
Is It still possible to acces to this pull request ?
> AvroReadSupport does not support Avro schema resolution
> -------------------------------------------------------
>
> Key: PARQUET-171
> URL: https://issues.apache.org/jira/browse/PARQUET-171
> Project: Parquet
> Issue Type: Bug
> Components: parquet-mr
> Reporter: Jeffrey Olchovy
>
> Given multiple "different-yet-compatible" Avro-backed Parquet files, a
> runtime exception will be encountered when trying to merge the metadata
> values across the files if they are used as input sources for a MapReduce job.
> A contrived example of this problem is provided, along with a derived version
> of {{AvroReadSupport}} that can correctly handle valid schema
> resolution/evolution scenarios.
> *Illustration of Problem*
> A simple Avro schema exists, which contains a single record type that
> consists of a required String member.
> {noformat}
> {"type":"record","name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":{"type":"string","avro.java.string":"String"}}]}
> {noformat}
> When stored as Parquet-Avro the resulting schema is:
> {noformat}
> message com.tapad.avro.Foo {
> required binary my_field (UTF8);
> }
> {noformat}
> Data is written to a Parquet-Avro file with the following contents:
> {noformat}
> my_field = aaa
> my_field = bbb
> {noformat}
> The schema for the Foo record is then changed so that its String member is
> made optional, with a default value of null now provided for the String
> member.
> {noformat}
> {"type":"record","name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":["null",{"type":"string","avro.java.string":"String"}],"default":null}]}
> message com.tapad.avro.Foo {
> optional binary my_field (UTF8);
> }
> {noformat}
> This change adheres to the Avro Schema Resoution rules as found in
> http://avro.apache.org/docs/current/spec.html#Schema+Resolution.
> Data is then written to a new Parquet-Avro file.
> {noformat}
> my_field = ccc
> {noformat}
> When both Parquet-Avro files are used as input to a MapReduce job, wherein
> the schemas in the data files are considered to be the "writer" schemas and
> the schema on our job's classpath -- in this case, the updated schema -- is
> used as the "reader" schema, the following {{RuntimeException}} is
> encountered:
> {noformat}
> Caused by: java.lang.RuntimeException: could not merge metadata: key
> avro.schema has conflicting values:
> [{"type":"record","name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":{"type":"string","avro.java.string":"String"}}]},
> {"type":"record",
> "name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":["null",{"type":"string","avro.java.string":"String"}],"default":null}]}]
> 42 at
> parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67)
> 43 at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84)
> 44 at
> parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:263)
> ...
> {noformat}
> *Solution*
> Each schema in every data file (the "writer" schemas) should check for schema
> compatibility with the "reader" schema. If all "writer" schemas are
> compatible with the "reader" schema, all records in all data files can be
> migrated to the "reader" schema.
> The Apache Avro library provides utilities for performing compatibility
> checks across schemas and provided is a derived version of
> {{AvroReadSupport}} which uses these utilities to successfully process the
> records in the aforementioned data files when they are both used as input to
> a MapReduce job.
> -_NOTE: Solution will be provided as a hyperlink to a Github Pull Request_-
> https://github.com/apache/incubator-parquet-mr/pull/107
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)