[
https://issues.apache.org/jira/browse/PARQUET-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15116413#comment-15116413
]
Ryan Blue commented on PARQUET-465:
-----------------------------------
[~eggsby], sorry for the confusion here. Right now, parquet-avro has a strange
way of doing things. There are two read-side schemas you should worry about.
The one you're setting is the schema you expect records to conform to. So that
allows you to set the defaults for missing fields. Then there is also the
projection schema, which is used to actually do the column selection. So you
could filter out foo with the projection schema and also have a default value
for it added to your records using the read schema. I think these should be the
same thing and we should derive the projection from the expected schema... I
just haven't had time to build it yet (patches welcome!).
So here's what you should do to fix it: set the projection schema to your
reader schema using {{AvroParquetInputFormat.setRequestedProjection}}. That
should prevent the code from expecting foo.
> Parquet-Avro does not support field removal
> -------------------------------------------
>
> Key: PARQUET-465
> URL: https://issues.apache.org/jira/browse/PARQUET-465
> Project: Parquet
> Issue Type: Bug
> Components: parquet-avro
> Affects Versions: 1.8.0
> Reporter: Thomas Omans
>
> Parquet avro does not support removal of fields, when used with the new
> compatibility layer:
> Given a parquet file written with parquet avro at v1 and the following schema:
> {code}
> record FooBar {
> long foo;
> string bar;
> }
> {code}
> And the following configuration settings:
> {code}
> job.getConfiguration.setBoolean(AvroReadSupport.AVRO_COMPATIBILITY, false)
> AvroParquetInputFormat.setAvroReadSchema(job, avroReaderSchema)
> {code}
> A job fails when trying to read it using schema version v2:
> {code}
> record FooBar {
> string bar;
> }
> {code}
> With the error:
> {code}
> org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema mismatch:
> Avro field 'foo' not found
> at
> org.apache.parquet.avro.AvroRecordConverter.getAvroField(AvroRecordConverter.java:159)
> {code}
> It looks like because it sees the field in the original version it assumes
> the new version must expect it, but this case just means that the field was
> removed. Avro schema resolution dictates that you just ignore this field,
> since it is not relevant in the new version.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)