[
https://issues.apache.org/jira/browse/PARQUET-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17357463#comment-17357463
]
Philip Wilcox edited comment on PARQUET-2055 at 6/4/21, 4:01 PM:
-----------------------------------------------------------------
[~gszadovszky] that makes sense, thank you for that link! I think I definitely
understand the situation better now than when I first opened this ticket - I
think really what I was going for was a feature request, "Support for following
avro schema evolution rules when reading Parquet back into newer Avro schemas"
which is very similar to that PARQUET-465.
The context is very useful! Yeah, renaming is tough if you allow addition and
removal. I don't think I fully understand the difference between the projection
capabilities of parquet-mr and Avro schema evolution rules? Because in terms of
the conversion to Avro, this seems like a conflict with standard behavior?
Renaming fields in Avro schema evolution is not generally supported (outside of
aliases), for instance, the Schema Resolution section of that Avro spec doc
specifies that "the ordering of fields may be different: fields are matched by
name" when reading data into a different schema.
However, this is probably, by this point, behavior that others depend on from
the current Parquet reading projection.
Is "read with support for Avro schema evolution" something that would make
sense as a new feature? The projection code seems tantalizingly close to what
would be needed for this as well, though I'm not familiar enough with the
library to know what else would need changing or what else would potentially
break.
was (Author: philipwilcox):
[~gszadovszky] that makes sense, thank you for that link! I think I definitely
understand the situation better now than when I first opened this ticket - I
think really what I was going for was a feature request, "Support for following
avro schema evolution rules when reading Parquet back into newer Avro schemas"
which is very similar to that PARQUET-465.
The context is very useful! Yeah, renaming is tough if you allow addition and
removal. I don't think I fully understand the difference between the projection
capabilities of parquet-mr and Avro schema evolution rules? Because in terms of
the conversion to Avro, this seems like a conflict with standard behavior?
Renaming fields in Avro schema evolution is not generally supported (outside of
aliases), for instance, the Schema Resolution section of that Avro spec doc
specifies that "the ordering of fields may be different: fields are matched by
name" when reading data into a different schema.
However, this is probably, by this point, behavior that others depend on from
the current Parquet reading projection.
Is "read with support for Avro schema evolution" something that would make
sense as a new feature?
> Schema mismatch for reading Avro from parquet file with old schema version?
> ---------------------------------------------------------------------------
>
> Key: PARQUET-2055
> URL: https://issues.apache.org/jira/browse/PARQUET-2055
> Project: Parquet
> Issue Type: Bug
> Components: parquet-avro
> Affects Versions: 1.11.0
> Environment: Linux, Apache Beam 2.28.0, Java 11
> Reporter: Philip Wilcox
> Priority: Minor
>
> I ran into what looks like a bug in the Parquet Avro reading code, around
> trying to read a file written with a previous version of a schema with a new,
> evolved version of the schema.
> I'm using Apache Beam's ParquetIO library, which supports passing in schemas
> to use for "projection" and I was investigating if that would work for me
> here. However, it didn't work, complaining that my new reader schema had a
> field that wasn't in the writer schema.
>
> I traced this through to a couple places in the parquet-avro code that don't
> look right to me:
>
> First, in `prepareForRead` here:
> [https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroReadSupport.java#L116]
> The `parquetSchema` var comes from `parquetSchema =
> readContext.getRequestedSchema();` while the `avroSchema` var comes from the
> parquet file itself with `avroSchema = new
> Schema.Parser().parse(keyValueMetaData.get(AVRO_SCHEMA_METADATA_KEY));`
> I can verify that `parquetSchema` is the schema I'm requesting it be
> projected to and that `avroSchema` is the schema from the file, but the
> naming looks backward, shouldn't `parquetSchema` be the one from the parquet
> file?
> Following the stack down, I was hitting this line:
> https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroIndexedRecordConverter.java#L91
> here it was failing because the `avroSchema` didn't have a field that was in
> the `parquetSchema`, with the variables assigned in the same way as above.
> That's the case I was hoping to use this projection for, though - to get the
> record read with the new reader schema, using the default value from the new
> schema for the new field. In fact, the comment on line 101 "store defaults
> for any new Avro fields from avroSchema that are not in the writer schema
> (parquetSchema)" suggests that the intent was for this to work, but the
> actual code has the writer schema in avroSchema and the reader schema in
> parquetSchema.
> (Additionally, I'd want this to support schema evolution both for adding an
> optional field and also removing an old field - so just flipping the names
> around would result in this still breaking if the reader schema dropped a
> field from the writer schema...)
> Looking to understand if I'm interpreting this correctly, or if there's
> another path that's intended to be used.
> Thank you!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)