Re: Met a schema problem when using AvroParquetInputFormat

Wei Yan Mon, 11 May 2015 10:53:32 -0700

Thanks for the update, Ryan.
Yes, I found this info in https://issues.apache.org/jira/browse/PARQUET-139,
which avoids to merge the schema in the client side.


And for schema merge, is then plan for defining some rules for merging
schemas, like merging "int" and "long" to a "long" field? I asked this
because we have some parquet files written by different schemas, due to
some **history** reason. Allow this type of merging can help a lot when we
process the data. Besides MapReduce application, we also meet the schema
problem when using hive and spark-sql to load the data.

-Wei

On Mon, May 11, 2015 at 10:45 AM, Ryan Blue <[email protected]> wrote:

> To follow up, I think the problem here was that we were merging two
> Parquet schemas. We don't really have rules for merging schemas and we
> don't really need them. 1.6.0 works because we resolve the expected schema
> with each file schema individually.
>
> This will still be a problem if you use client-side metadata instead of
> task-side.
>
> rb
>
>
> On 05/06/2015 08:43 PM, Alex Levenson wrote:
>
>> Glad that worked!
>>
>> On Wed, May 6, 2015 at 6:42 PM, Wei Yan <[email protected]> wrote:
>>
>>  Thanks, Alex.
>>> The new version solves the issue.
>>>
>>> -Wei
>>>
>>> On Tue, May 5, 2015 at 8:20 PM, Alex Levenson <
>>> [email protected]> wrote:
>>>
>>>  1.6.0rc1 is pretty old, have you tried with 1.6.0 ?
>>>>
>>>> On Tue, May 5, 2015 at 9:31 AM, Wei Yan <[email protected]> wrote:
>>>>
>>>>  Hi,
>>>>>
>>>>> Have met a problem for using AvroParquetInputFromat for my MapReduce
>>>>>
>>>> job.
>>>
>>>> The input files are written using two different version schemas. One
>>>>>
>>>> field
>>>>
>>>>> in v1 is "int", while in v2 is "long". The Exception:
>>>>>
>>>>> Exception in thread "main"
>>>>> parquet.schema.IncompatibleSchemaModificationException: can not merge
>>>>>
>>>> type
>>>>
>>>>> optional int32 a into optional int64 a
>>>>> at parquet.schema.PrimitiveType.union(PrimitiveType.java:513)
>>>>> at parquet.schema.GroupType.mergeFields(GroupType.java:359)
>>>>> at parquet.schema.GroupType.union(GroupType.java:341)
>>>>> at parquet.schema.GroupType.mergeFields(GroupType.java:359)
>>>>> at parquet.schema.MessageType.union(MessageType.java:138)
>>>>> at
>>>>>
>>>> parquet.hadoop.ParquetFileWriter.mergeInto(ParquetFileWriter.java:497)
>>>
>>>> at
>>>>>
>>>> parquet.hadoop.ParquetFileWriter.mergeInto(ParquetFileWriter.java:470)
>>>
>>>> at
>>>>>
>>>>>
>>>>>
>>>>
>>> parquet.hadoop.ParquetFileWriter.getGlobalMetaData(ParquetFileWriter.java:446)
>>>
>>>> at
>>>>>
>>>> parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:429)
>>>>
>>>>> at
>>>>>
>>>> parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:412)
>>>>
>>>>> at
>>>>>
>>>>>
>>>>>
>>>>
>>> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:589)
>>>
>>>>
>>>>> I'm using Parquet 1.5, and it looks "int" cannot be merged with
>>>>>
>>>> "long". I
>>>
>>>> tried 1.6.rc1, and set the "parquet.strict.typing", but still cannot
>>>>>
>>>> help.
>>>>
>>>>>
>>>>> So I want to ask is there anyway to solve this problem, like
>>>>>
>>>> automatically
>>>>
>>>>> convert "int" to "long"? instead of re-writing all data using the same
>>>>> version.
>>>>>
>>>>> thanks,
>>>>> Wei
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Alex Levenson
>>>> @THISWILLWORK
>>>>
>>>>
>>>
>>
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Re: Met a schema problem when using AvroParquetInputFormat

Reply via email to