[GitHub] [spark] xkrogen opened a new pull request #34009: SPARK-34378 [AVRO] Enhance AvroSerializer validation to allow extra nullable Avro fields

GitBox Wed, 15 Sep 2021 15:12:56 -0700


xkrogen opened a new pull request #34009:
URL: https://github.com/apache/spark/pull/34009



   ### What changes were proposed in this pull request?
   Loosen the schema validation logic in `AvroSerializer` to accommodate the 
situation where a user has provided an explicit schema (via `avroSchema`) and 
this schema has extra fields which are not present in the Catalyst schema (the 
DF being written). Specifically, extra _nullable_ fields will be allowed and 
populated as null. _Required_ fields (non-null) will still be checked for 
existence.
   
   ### Why are the changes needed?
   It's common for Avro schemas to evolve in a _compatible_ way (as discussed 
in Confluent's documentation on [Schema Evolution and 
Compatibility](https://docs.confluent.io/platform/current/schema-registry/avro.html);
 here I refer to `FULL` compatibility). Under such a scenario, new _optional_ 
fields are added to a schema. Producers are free to include the new field if 
they so choose, and consumers are free to read the new field if they so choose. 
It is optional on both sides.
   
   Consider the following code:
   ```
   val outputSchema = getOutputSchema()
   df.write.format("avro").option("avroSchema", outputSchema).save(...)
   ```
   If you have a situation where schemas are managed in some centralized 
repository (e.g. a [schema 
registry](https://docs.confluent.io/platform/current/schema-registry/index.html)),
 `outputSchema` may update at some point to add a new optional field, without 
you necessarily initiating any action on your side as a data producer. With the 
current code, this would cause the producer job to break, because validation 
would complain that the newly added field is not present in the DataFrame. 
Really, the producer should be able to continue producing data as normal even 
without adding the new field to the DataFrame it is writing out, because the 
field is optional.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, when using the `avroSchema` option on the Avro data source during 
writes, validation is less strict, and allows for (compatible) schema evolution 
to be handled more gracefully.
   
   ### How was this patch tested?
   New unit tests added. We've also been employing this logic internally for a 
few years, though the implementation was quite different due to recent changes 
in this area of the code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] xkrogen opened a new pull request #34009: SPARK-34378 [AVRO] Enhance AvroSerializer validation to allow extra nullable Avro fields

Reply via email to