xkrogen opened a new pull request #31333:
URL: https://github.com/apache/spark/pull/31333
### What changes were proposed in this pull request?
Improve the error messages for incompatibilities between Avro and Catalyst
schemas. First, make `AvroSerializer` more similar to `AvroDeserializer` in
printing out contextual information such as hierarchical field names.
Standardize exception messages in both serializer and deserializer to always
include such contextual information, and include a top-level exception which
shows the full schemas which were being parsed when the incompatibility was
found. Both now print out the hierarchical name for both the Avro and Catalyst
fields, since they may be different due to case sensitivity and Avro union
handling.
### Why are the changes needed?
The error messages in this type of failure scenario are very lacking in
information on the write path (`AvroSerializer`). Below are two examples of
messages that provide insufficient information to determine what went wrong
(lacking in field names, context about the overall schema structure, etc.).
```
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert
Catalyst type IntegerType to Avro type "float".
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert
Catalyst type StructType(StructField(bar,IntegerType,true)) to Avro type
{"type":"record","name":"test","fields":[{"name":"NOTbar","type":["null","int"],"default":null}]}.
```
The error messages currently existing in `AvroDeserializer` are much better,
but still not very internally consistent, and it would be better if they were
consistent with the newly added exception messages in `AvroSerializer`.
### Does this PR introduce _any_ user-facing change?
Error messages when there are incompatibilities between Avro and Catalyst
schemas will be greatly improved on when writing Avro data using the
`avroSchema` option, a little bit improved when reading Avro data, and much
more consistent between the two.
Below is an example of a new message. See `AvroSerdeSuite` for more examples.
```
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert
Catalyst type
StructType(StructField(foo,StructType(StructField(bar,IntegerType,true)),true))
to Avro type
{"type":"record","name":"top","fields":[{"name":"foo","type":"int"}]}
at
org.apache.spark.sql.avro.AvroSerializer.liftedTree1$1(AvroSerializer.scala:83)
...
Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot
convert Catalyst field 'foo' to Avro field 'foo' because schema is incompatible
(sqlType = StructType(StructField(bar,IntegerType,true)), avroType = "int")
at
org.apache.spark.sql.avro.AvroSerializer.newConverter(AvroSerializer.scala:230)
...
```
### How was this patch tested?
New unit test suite, `AvroSerdeSuite`, was added to test corner cases on
`AvroSerializer` and `AvroDeserializer` and verify that the exception messages
are as expected. Existing tests in `AvroSuite` also continue to pass, with
modifications in places where assertions were made about the exceptions that
would be thrown.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]