Thanks for the update, Ryan. Yes, I found this info in https://issues.apache.org/jira/browse/PARQUET-139, which avoids to merge the schema in the client side.
And for schema merge, is then plan for defining some rules for merging schemas, like merging "int" and "long" to a "long" field? I asked this because we have some parquet files written by different schemas, due to some **history** reason. Allow this type of merging can help a lot when we process the data. Besides MapReduce application, we also meet the schema problem when using hive and spark-sql to load the data. -Wei On Mon, May 11, 2015 at 10:45 AM, Ryan Blue <[email protected]> wrote: > To follow up, I think the problem here was that we were merging two > Parquet schemas. We don't really have rules for merging schemas and we > don't really need them. 1.6.0 works because we resolve the expected schema > with each file schema individually. > > This will still be a problem if you use client-side metadata instead of > task-side. > > rb > > > On 05/06/2015 08:43 PM, Alex Levenson wrote: > >> Glad that worked! >> >> On Wed, May 6, 2015 at 6:42 PM, Wei Yan <[email protected]> wrote: >> >> Thanks, Alex. >>> The new version solves the issue. >>> >>> -Wei >>> >>> On Tue, May 5, 2015 at 8:20 PM, Alex Levenson < >>> [email protected]> wrote: >>> >>> 1.6.0rc1 is pretty old, have you tried with 1.6.0 ? >>>> >>>> On Tue, May 5, 2015 at 9:31 AM, Wei Yan <[email protected]> wrote: >>>> >>>> Hi, >>>>> >>>>> Have met a problem for using AvroParquetInputFromat for my MapReduce >>>>> >>>> job. >>> >>>> The input files are written using two different version schemas. One >>>>> >>>> field >>>> >>>>> in v1 is "int", while in v2 is "long". The Exception: >>>>> >>>>> Exception in thread "main" >>>>> parquet.schema.IncompatibleSchemaModificationException: can not merge >>>>> >>>> type >>>> >>>>> optional int32 a into optional int64 a >>>>> at parquet.schema.PrimitiveType.union(PrimitiveType.java:513) >>>>> at parquet.schema.GroupType.mergeFields(GroupType.java:359) >>>>> at parquet.schema.GroupType.union(GroupType.java:341) >>>>> at parquet.schema.GroupType.mergeFields(GroupType.java:359) >>>>> at parquet.schema.MessageType.union(MessageType.java:138) >>>>> at >>>>> >>>> parquet.hadoop.ParquetFileWriter.mergeInto(ParquetFileWriter.java:497) >>> >>>> at >>>>> >>>> parquet.hadoop.ParquetFileWriter.mergeInto(ParquetFileWriter.java:470) >>> >>>> at >>>>> >>>>> >>>>> >>>> >>> parquet.hadoop.ParquetFileWriter.getGlobalMetaData(ParquetFileWriter.java:446) >>> >>>> at >>>>> >>>> parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:429) >>>> >>>>> at >>>>> >>>> parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:412) >>>> >>>>> at >>>>> >>>>> >>>>> >>>> >>> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:589) >>> >>>> >>>>> I'm using Parquet 1.5, and it looks "int" cannot be merged with >>>>> >>>> "long". I >>> >>>> tried 1.6.rc1, and set the "parquet.strict.typing", but still cannot >>>>> >>>> help. >>>> >>>>> >>>>> So I want to ask is there anyway to solve this problem, like >>>>> >>>> automatically >>>> >>>>> convert "int" to "long"? instead of re-writing all data using the same >>>>> version. >>>>> >>>>> thanks, >>>>> Wei >>>>> >>>>> >>>> >>>> >>>> -- >>>> Alex Levenson >>>> @THISWILLWORK >>>> >>>> >>> >> >> >> > > -- > Ryan Blue > Software Engineer > Cloudera, Inc. >
