xkrogen opened a new pull request #34009: URL: https://github.com/apache/spark/pull/34009
### What changes were proposed in this pull request? Loosen the schema validation logic in `AvroSerializer` to accommodate the situation where a user has provided an explicit schema (via `avroSchema`) and this schema has extra fields which are not present in the Catalyst schema (the DF being written). Specifically, extra _nullable_ fields will be allowed and populated as null. _Required_ fields (non-null) will still be checked for existence. ### Why are the changes needed? It's common for Avro schemas to evolve in a _compatible_ way (as discussed in Confluent's documentation on [Schema Evolution and Compatibility](https://docs.confluent.io/platform/current/schema-registry/avro.html); here I refer to `FULL` compatibility). Under such a scenario, new _optional_ fields are added to a schema. Producers are free to include the new field if they so choose, and consumers are free to read the new field if they so choose. It is optional on both sides. Consider the following code: ``` val outputSchema = getOutputSchema() df.write.format("avro").option("avroSchema", outputSchema).save(...) ``` If you have a situation where schemas are managed in some centralized repository (e.g. a [schema registry](https://docs.confluent.io/platform/current/schema-registry/index.html)), `outputSchema` may update at some point to add a new optional field, without you necessarily initiating any action on your side as a data producer. With the current code, this would cause the producer job to break, because validation would complain that the newly added field is not present in the DataFrame. Really, the producer should be able to continue producing data as normal even without adding the new field to the DataFrame it is writing out, because the field is optional. ### Does this PR introduce _any_ user-facing change? Yes, when using the `avroSchema` option on the Avro data source during writes, validation is less strict, and allows for (compatible) schema evolution to be handled more gracefully. ### How was this patch tested? New unit tests added. We've also been employing this logic internally for a few years, though the implementation was quite different due to recent changes in this area of the code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
