Gianluca Amori created SPARK-27506:
--------------------------------------

             Summary: Function `from_avro` doesn't allow deserialization of 
data using other compatible schemas
                 Key: SPARK-27506
                 URL: https://issues.apache.org/jira/browse/SPARK-27506
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.4.1
            Reporter: Gianluca Amori


 SPARK-24768 and subtasks introduced support to read and write Avro data by 
parsing a binary column of Avro format and converting it into its corresponding 
catalyst value (and viceversa).

 

The current implementation has the limitation of requiring deserialization of 
an event with the exact same schema with which it was serialized. This breaks 
one of the most important features of Avro, schema evolution 
[https://docs.confluent.io/current/schema-registry/avro.html] - most 
importantly, the ability to read old data with a newer (compatible) schema 
without breaking the consumer.

 

The GenericDatumReader in the Avro library already supports passing an optional 
*writer's schema* (the schema with which the record was serialized) alongside a 
mandatory *reader's schema* (the schema with which the record is going to be 
deserialized). The proposed change is to do the same in the from_avro function, 
allowing the possibility to pass an optional writer's schema to be used in the 
deserialization.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to