[jira] [Commented] (PARQUET-465) Parquet-Avro does not support field removal

Ryan Blue (JIRA) Mon, 25 Jan 2016 16:56:17 -0800

    [ 
https://issues.apache.org/jira/browse/PARQUET-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15116413#comment-15116413
 ]


Ryan Blue commented on PARQUET-465:
-----------------------------------

[~eggsby], sorry for the confusion here. Right now, parquet-avro has a strange 
way of doing things. There are two read-side schemas you should worry about. 
The one you're setting is the schema you expect records to conform to. So that 
allows you to set the defaults for missing fields. Then there is also the 
projection schema, which is used to actually do the column selection. So you 
could filter out foo with the projection schema and also have a default value 
for it added to your records using the read schema. I think these should be the 
same thing and we should derive the projection from the expected schema... I 
just haven't had time to build it yet (patches welcome!).

So here's what you should do to fix it: set the projection schema to your 
reader schema using {{AvroParquetInputFormat.setRequestedProjection}}. That 
should prevent the code from expecting foo.

> Parquet-Avro does not support field removal
> -------------------------------------------
>
>                 Key: PARQUET-465
>                 URL: https://issues.apache.org/jira/browse/PARQUET-465
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-avro
>    Affects Versions: 1.8.0
>            Reporter: Thomas Omans
>
> Parquet avro does not support removal of fields, when used with the new 
> compatibility layer:
> Given a parquet file written with parquet avro at v1 and the following schema:
> {code}
> record FooBar {
>   long foo;
>   string bar;
> }
> {code}
> And the following configuration settings:
> {code}
> job.getConfiguration.setBoolean(AvroReadSupport.AVRO_COMPATIBILITY, false)
> AvroParquetInputFormat.setAvroReadSchema(job, avroReaderSchema)
> {code}
> A job fails when trying to read it using schema version v2:
> {code}
> record FooBar {
>   string bar;
> }
> {code}
> With the error:
> {code}
> org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema mismatch: 
> Avro field 'foo' not found
>       at 
> org.apache.parquet.avro.AvroRecordConverter.getAvroField(AvroRecordConverter.java:159)
> {code}
> It looks like because it sees the field in the original version it assumes 
> the new version must expect it, but this case just means that the field was 
> removed. Avro schema resolution dictates that you just ignore this field, 
> since it is not relevant in the new version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PARQUET-465) Parquet-Avro does not support field removal

Reply via email to