xkrogen opened a new pull request #31201:
URL: https://github.com/apache/spark/pull/31201
### What changes were proposed in this pull request?
Make the field name matching between Avro and Catalyst schemas, on both the
reader and writer paths, respect the global SQL settings for case sensitivity
(i.e. case-insensitive by default). `AvroSerializer` and `AvroDeserializer`
share a common utility in `AvroUtils` to search for an Avro field to match a
given Catalyst field.
Improve the error messages for incompatibilities between Avro and Catalyst
schemas. First, make `AvroSerializer` more similar to `AvroDeserializer` in
printing out contextual information such as hierarchical field names.
Standardize exception messages in both serializer and deserializer to always
include such contextual information, and include a top-level exception which
shows the full schemas which were being parsed when the incompatibility was
found. Both now print out the hierarchical name for both the Avro and Catalyst
fields, since they may be different due to case sensitivity and Avro union
handling.
### Why are the changes needed?
Spark SQL is normally case-insensitive (by default), but currently when
`AvroSerializer` and `AvroDeserializer` perform matching between Catalyst
schemas and Avro schemas, the matching is done in a case-sensitive manner. So
for example the following will fail:
```scala
val avroSchema =
"""
|{
| "type" : "record",
| "name" : "test_schema",
| "fields" : [
| {"name": "foo", "type": "int"},
| {"name": "BAR", "type": "int"}
| ]
|}
""".stripMargin
val df = Seq((1, 3), (2, 4)).toDF("FOO", "bar")
df.write.option("avroSchema", avroSchema).format("avro").save(savePath)
```
The same is true on the read path, if we assume `testAvro` has been written
using the schema above, the below will fail to match the fields:
```scala
df.read.schema(new StructType().add("FOO", IntegerType).add("bar",
IntegerType))
.format("avro").load(testAvro)
```
In addition the error messages in this type of failure scenario are very
lacking in information on the write path (`AvroSerializer`). Below are two
examples of messages that provide insufficient information to determine what
went wrong (lacking in field names, context about the overall schema structure,
etc.).
```
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert
Catalyst type IntegerType to Avro type "float".
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert
Catalyst type StructType(StructField(bar,IntegerType,true)) to Avro type
{"type":"record","name":"test","fields":[{"name":"NOTbar","type":["null","int"],"default":null}]}.
```
### Does this PR introduce _any_ user-facing change?
When reading Avro data, or writing Avro data using the `avroSchema` option,
field matching will be performed with case sensitivity respecting the global
SQL settings.
Error messages when there are incompatibilities between Avro and Catalyst
schemas will be greatly improved on when writing Avro data using the
`avroSchema` option, a little bit improved when reading Avro data, and much
more consistent between the two.
Below is an example of a new message. See `AvroSerdeSuite` for more examples.
```
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert
Catalyst type
StructType(StructField(foo,StructType(StructField(bar,IntegerType,true)),true))
to Avro type
{"type":"record","name":"top","fields":[{"name":"foo","type":"int"}]}
at
org.apache.spark.sql.avro.AvroSerializer.liftedTree1$1(AvroSerializer.scala:83)
...
Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot
convert Catalyst field 'foo' to Avro field 'foo' because schema is incompatible
(sqlType = StructType(StructField(bar,IntegerType,true)), avroType = "int")
at
org.apache.spark.sql.avro.AvroSerializer.newConverter(AvroSerializer.scala:230)
...
```
### How was this patch tested?
New tests added to `AvroSuite` to validate the case sensitivity logic in an
end-to-end manner through the SQL engine.
New unit test suite, `AvroSerdeSuite`, was added to test corner cases on
`AvroSerializer` and `AvroDeserializer` and verify that the exception messages
are as expected.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]