[
https://issues.apache.org/jira/browse/PARQUET-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15116441#comment-15116441
]
Thomas Omans commented on PARQUET-465:
--------------------------------------
Ryan,
Thank you for the quick response.
Unfortunately, adding
`AvroParquetInputFormat.setRequestedProjection(avroReaderSchema)` breaks other
things. Specifically you lose the ability to rename fields (lookups silently
fail and put `null` into the field value) and apply default values.
I actually created a series of evolving schemas in order to see what was
supported and what was not:
{code}
@namespace("com.example.avro.compatibility")
protocol Compatibility {
// original record, only one field
@namespace("com.example.avro.compatibility.v1")
record CompatibilityTestRecord {
long time;
}
// add new field with default value
@namespace("com.example.avro.compatibility.v2")
record CompatibilityTestRecord {
long time;
string id = "v2";
}
// reorder fields
@namespace("com.example.avro.compatibility.v3")
record CompatibilityTestRecord {
string id = "v3";
long time;
}
// alias field
@namespace("com.example.avro.compatibility.v4")
record CompatibilityTestRecord {
string @aliases(["id"]) notId = "v4";
long @aliases(["time"]) notTime;
}
// drop field
@namespace("com.example.avro.compatibility.v5")
record CompatibilityTestRecord {
string id = "v5";
}
}
{code}
I wrote 5 parquet files each containing one record written at the specific
version, then tried to read them in: v2 reading v1, v3 reading v2 and v1, etc.
Only setting setAvroReadSchema to the latest version of the schema causes all
to pass besides the final case of dropping a field, but setting both read and
projection causes all to fail due to defaults not being properly applied and
aliases not being respected.
Thanks again, what is there is great -- I have been reading your commits all
day :)
> Parquet-Avro does not support field removal
> -------------------------------------------
>
> Key: PARQUET-465
> URL: https://issues.apache.org/jira/browse/PARQUET-465
> Project: Parquet
> Issue Type: Bug
> Components: parquet-avro
> Affects Versions: 1.8.0
> Reporter: Thomas Omans
>
> Parquet avro does not support removal of fields, when used with the new
> compatibility layer:
> Given a parquet file written with parquet avro at v1 and the following schema:
> {code}
> record FooBar {
> long foo;
> string bar;
> }
> {code}
> And the following configuration settings:
> {code}
> job.getConfiguration.setBoolean(AvroReadSupport.AVRO_COMPATIBILITY, false)
> AvroParquetInputFormat.setAvroReadSchema(job, avroReaderSchema)
> {code}
> A job fails when trying to read it using schema version v2:
> {code}
> record FooBar {
> string bar;
> }
> {code}
> With the error:
> {code}
> org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema mismatch:
> Avro field 'foo' not found
> at
> org.apache.parquet.avro.AvroRecordConverter.getAvroField(AvroRecordConverter.java:159)
> {code}
> It looks like because it sees the field in the original version it assumes
> the new version must expect it, but this case just means that the field was
> removed. Avro schema resolution dictates that you just ignore this field,
> since it is not relevant in the new version.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)