xkrogen opened a new pull request #31490:
URL: https://github.com/apache/spark/pull/31490
### What changes were proposed in this pull request?
Provide the (configurable) ability to perform Avro-to-Catalyst schema field
matching using the position of the fields instead of their names. A new
`option` is added for the Avro datasource, `schemaRenaming`, which instructs
`AvroSerializer`/`AvroDeserializer` to perform positional field matching
instead of matching by name.
### Why are the changes needed?
This by-name matching is somewhat recent; prior to PR #24635, at least on
the write path, schemas were matched by positionally ("structural" comparison).
While by-name is better behavior as a default, it will be better to make this
configurable by a user. Even at the time that PR #24635 was handled, there was
[interest in making this behavior
configurable](https://github.com/apache/spark/pull/24635#issuecomment-494205251),
but it appears it went unaddressed.
There is precedence for configurability of this behavior as seen in PR
#29737, which added this support for ORC. Besides this precedence, the behavior
of Hive is to perform matching positionally
([ref](https://cwiki.apache.org/confluence/display/Hive/AvroSerDe#AvroSerDe-WritingtablestoAvrofiles)),
so this is behavior that Hadoop/Hive ecosystem users are familiar with.
### Does this PR introduce _any_ user-facing change?
Yes, a new option is provided for the Avro datasource, `schemaRenaming`,
which provides compatibility with Hive and pre-3.0.0 Spark behavior.
### How was this patch tested?
New unit tests are added within `AvroSuite` and `AvroSerdeSuite`, and most
of the existing tests within `AvroSerdeSuite` are adapted to perform the same
test using by-name and positional matching to ensure feature parity.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]