[ 
https://issues.apache.org/jira/browse/PARQUET-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15281542#comment-15281542
 ] 

Jeffrey Olchovy commented on PARQUET-171:
-----------------------------------------

Hey Jerome, for reference/historical purposes, you can find the patch here: 
https://github.com/apache/parquet-mr/pull/107.

Please note that this PR was ultimately closed and left unmerged because a fix 
for the underlying issue was resolved via parallel development in the upstream 
repository (https://issues.apache.org/jira/browse/PARQUET-139). If you use 
version 1.6.0 or greater, Avro schema evolution/resolution should work 
appropriately.

> AvroReadSupport does not support Avro schema resolution
> -------------------------------------------------------
>
>                 Key: PARQUET-171
>                 URL: https://issues.apache.org/jira/browse/PARQUET-171
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>            Reporter: Jeffrey Olchovy
>
> Given multiple "different-yet-compatible" Avro-backed Parquet files, a 
> runtime exception will be encountered when trying to merge the metadata 
> values across the files if they are used as input sources for a MapReduce job.
> A contrived example of this problem is provided, along with a derived version 
> of {{AvroReadSupport}} that can correctly handle valid schema 
> resolution/evolution scenarios.
> *Illustration of Problem*
> A simple Avro schema exists, which contains a single record type that 
> consists of a required String member.
> {noformat}
> {"type":"record","name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":{"type":"string","avro.java.string":"String"}}]}
> {noformat}
> When stored as Parquet-Avro the resulting schema is:
> {noformat}
> message com.tapad.avro.Foo {
>   required binary my_field (UTF8);
> }
> {noformat}
> Data is written to a Parquet-Avro file with the following contents:
> {noformat}
> my_field = aaa
> my_field = bbb
> {noformat}
> The schema for the Foo record is then changed so that its String member is 
> made optional, with a default value of null now provided for the String 
> member.
> {noformat}
> {"type":"record","name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":["null",{"type":"string","avro.java.string":"String"}],"default":null}]}
> message com.tapad.avro.Foo {
>   optional binary my_field (UTF8);
> }
> {noformat}
> This change adheres to the Avro Schema Resoution rules as found in 
> http://avro.apache.org/docs/current/spec.html#Schema+Resolution.
> Data is then written to a new Parquet-Avro file.
> {noformat}
> my_field = ccc
> {noformat}
> When both Parquet-Avro files are used as input to a MapReduce job, wherein 
> the schemas in the data files are considered to be the "writer" schemas and 
> the schema on our job's classpath -- in this case, the updated schema -- is 
> used as the "reader" schema, the following {{RuntimeException}} is 
> encountered:
> {noformat}
> Caused by: java.lang.RuntimeException: could not merge metadata: key 
> avro.schema has conflicting values: 
> [{"type":"record","name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":{"type":"string","avro.java.string":"String"}}]},
>  {"type":"record",    
> "name":"Foo","namespace":"com.tapad.avro","fields":[{"name":"my_field","type":["null",{"type":"string","avro.java.string":"String"}],"default":null}]}]
>  42       at 
> parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67)
>  43       at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84)
>  44       at 
> parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:263)
> ...
> {noformat}
> *Solution*
> Each schema in every data file (the "writer" schemas) should check for schema 
> compatibility with the "reader" schema. If all "writer" schemas are 
> compatible with the "reader" schema, all records in all data files can be 
> migrated to the "reader" schema.
> The Apache Avro library provides utilities for performing compatibility 
> checks across schemas and provided is a derived version of 
> {{AvroReadSupport}} which uses these utilities to successfully process the 
> records in the aforementioned data files when they are both used as input to 
> a MapReduce job.
> -_NOTE: Solution will be provided as a hyperlink to a Github Pull Request_- 
> https://github.com/apache/incubator-parquet-mr/pull/107



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to