[ 
https://issues.apache.org/jira/browse/PARQUET-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15116441#comment-15116441
 ] 

Thomas Omans commented on PARQUET-465:
--------------------------------------

Ryan, 

Thank you for the quick response.

Unfortunately, adding 
`AvroParquetInputFormat.setRequestedProjection(avroReaderSchema)` breaks other 
things.  Specifically you lose the ability to rename fields (lookups silently 
fail and put `null` into the field value) and apply default values.

I actually created a series of evolving schemas in order to see what was 
supported and what was not:

{code}
@namespace("com.example.avro.compatibility")
protocol Compatibility {

  // original record, only one field
  @namespace("com.example.avro.compatibility.v1")
  record CompatibilityTestRecord {
    long time;
  }

  // add new field with default value
  @namespace("com.example.avro.compatibility.v2")
  record CompatibilityTestRecord {
    long time;
    string id = "v2";
  }

  // reorder fields
  @namespace("com.example.avro.compatibility.v3")
  record CompatibilityTestRecord {
    string id = "v3";
    long time;
  }
  
  // alias field
  @namespace("com.example.avro.compatibility.v4")
  record CompatibilityTestRecord {
    string @aliases(["id"]) notId = "v4";
    long @aliases(["time"]) notTime;
  }

  // drop field
  @namespace("com.example.avro.compatibility.v5")
  record CompatibilityTestRecord {
    string id = "v5";
  }

}
{code}

I wrote 5 parquet files each containing one record written at the specific 
version, then tried to read them in: v2 reading v1, v3 reading v2 and v1, etc.

Only setting setAvroReadSchema to the latest version of the schema causes all 
to pass besides the final case of dropping a field, but setting both read and 
projection causes all to fail due to defaults not being properly applied and 
aliases not being respected.

Thanks again, what is there is great -- I have been reading your commits all 
day :)

> Parquet-Avro does not support field removal
> -------------------------------------------
>
>                 Key: PARQUET-465
>                 URL: https://issues.apache.org/jira/browse/PARQUET-465
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-avro
>    Affects Versions: 1.8.0
>            Reporter: Thomas Omans
>
> Parquet avro does not support removal of fields, when used with the new 
> compatibility layer:
> Given a parquet file written with parquet avro at v1 and the following schema:
> {code}
> record FooBar {
>   long foo;
>   string bar;
> }
> {code}
> And the following configuration settings:
> {code}
> job.getConfiguration.setBoolean(AvroReadSupport.AVRO_COMPATIBILITY, false)
> AvroParquetInputFormat.setAvroReadSchema(job, avroReaderSchema)
> {code}
> A job fails when trying to read it using schema version v2:
> {code}
> record FooBar {
>   string bar;
> }
> {code}
> With the error:
> {code}
> org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema mismatch: 
> Avro field 'foo' not found
>       at 
> org.apache.parquet.avro.AvroRecordConverter.getAvroField(AvroRecordConverter.java:159)
> {code}
> It looks like because it sees the field in the original version it assumes 
> the new version must expect it, but this case just means that the field was 
> removed. Avro schema resolution dictates that you just ignore this field, 
> since it is not relevant in the new version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to