[
https://issues.apache.org/jira/browse/PARQUET-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15118075#comment-15118075
]
Ryan Blue edited comment on PARQUET-465 at 1/26/16 9:29 PM:
------------------------------------------------------------
[~eggsby], thanks for the thoroughness! Have you tried using different schemas
for the projection and read?
It looks like the read schema can handle renames, but the projection schema
must use the same names as the underlying file (which we should definitely
fix). What if you try with v5 for your read schema, but this write schema:
{code}
{ "type": "record",
"name": "com.example.avro.compatibility.v5.CompatibilityTestRecord",
"fields": [
{ "notId": "string" }
] }
{code}
Also, what do you think about adding these cases as tests? I'd love to verify
this behavior and use your work to ensure that we don't have future
regressions! We also need to improve this area and this is a great start for
figuring out what to improve. I was just wondering what the cases are where you
need different read and projection schemas on my way to work and this answers
it.
was (Author: rdblue):
[~eggsby], thanks for the thoroughness! Have you tried using different schemas
for the projection and read?
It looks like the read schema can handle renames, but the projection schema
must use the same names as the underlying file (which we should definitely
fix). What if you try with v5 for your read schema, but this write schema:
{code}
{ "type": "record",
"name": "com.example.avro.compatibility.v5.CompatibilityTestRecord",
"fields": [
"notId": "string"
] }
{code}
Also, what do you think about adding these cases as tests? I'd love to verify
this behavior and use your work to ensure that we don't have future
regressions! We also need to improve this area and this is a great start for
figuring out what to improve. I was just wondering what the cases are where you
need different read and projection schemas on my way to work and this answers
it.
> Parquet-Avro does not support field removal
> -------------------------------------------
>
> Key: PARQUET-465
> URL: https://issues.apache.org/jira/browse/PARQUET-465
> Project: Parquet
> Issue Type: Bug
> Components: parquet-avro
> Affects Versions: 1.8.0
> Reporter: Thomas Omans
>
> Parquet avro does not support removal of fields, when used with the new
> compatibility layer:
> Given a parquet file written with parquet avro at v1 and the following schema:
> {code}
> record FooBar {
> long foo;
> string bar;
> }
> {code}
> And the following configuration settings:
> {code}
> job.getConfiguration.setBoolean(AvroReadSupport.AVRO_COMPATIBILITY, false)
> AvroParquetInputFormat.setAvroReadSchema(job, avroReaderSchema)
> {code}
> A job fails when trying to read it using schema version v2:
> {code}
> record FooBar {
> string bar;
> }
> {code}
> With the error:
> {code}
> org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema mismatch:
> Avro field 'foo' not found
> at
> org.apache.parquet.avro.AvroRecordConverter.getAvroField(AvroRecordConverter.java:159)
> {code}
> It looks like because it sees the field in the original version it assumes
> the new version must expect it, but this case just means that the field was
> removed. Avro schema resolution dictates that you just ignore this field,
> since it is not relevant in the new version.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)