xkrogen opened a new pull request #31333:
URL: https://github.com/apache/spark/pull/31333


   ### What changes were proposed in this pull request?
   Improve the error messages for incompatibilities between Avro and Catalyst 
schemas. First, make `AvroSerializer` more similar to `AvroDeserializer` in 
printing out contextual information such as hierarchical field names. 
Standardize exception messages in both serializer and deserializer to always 
include such contextual information, and include a top-level exception which 
shows the full schemas which were being parsed when the incompatibility was 
found. Both now print out the hierarchical name for both the Avro and Catalyst 
fields, since they may be different due to case sensitivity and Avro union 
handling.
   
   ### Why are the changes needed?
   The error messages in this type of failure scenario are very lacking in 
information on the write path (`AvroSerializer`). Below are two examples of 
messages that provide insufficient information to determine what went wrong 
(lacking in field names, context about the overall schema structure, etc.).
   ```
   org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert 
Catalyst type IntegerType to Avro type "float".
   
   org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert 
Catalyst type StructType(StructField(bar,IntegerType,true)) to Avro type 
{"type":"record","name":"test","fields":[{"name":"NOTbar","type":["null","int"],"default":null}]}.
   ```
   The error messages currently existing in `AvroDeserializer` are much better, 
but still not very internally consistent, and it would be better if they were 
consistent with the newly added exception messages in `AvroSerializer`.
   
   ### Does this PR introduce _any_ user-facing change?
   Error messages when there are incompatibilities between Avro and Catalyst 
schemas will be greatly improved on when writing Avro data using the 
`avroSchema` option, a little bit improved when reading Avro data, and much 
more consistent between the two.
   
   Below is an example of a new message. See `AvroSerdeSuite` for more examples.
   ```
   org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert 
Catalyst type 
StructType(StructField(foo,StructType(StructField(bar,IntegerType,true)),true)) 
to Avro type 
{"type":"record","name":"top","fields":[{"name":"foo","type":"int"}]}
        at 
org.apache.spark.sql.avro.AvroSerializer.liftedTree1$1(AvroSerializer.scala:83)
   ...
   Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot 
convert Catalyst field 'foo' to Avro field 'foo' because schema is incompatible 
(sqlType = StructType(StructField(bar,IntegerType,true)), avroType = "int")
        at 
org.apache.spark.sql.avro.AvroSerializer.newConverter(AvroSerializer.scala:230)
   ...
   ```
   
   ### How was this patch tested?
   New unit test suite, `AvroSerdeSuite`, was added to test corner cases on 
`AvroSerializer` and `AvroDeserializer` and verify that the exception messages 
are as expected. Existing tests in `AvroSuite` also continue to pass, with 
modifications in places where assertions were made about the exceptions that 
would be thrown.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to