[GitHub] [spark] xkrogen opened a new pull request #31201: [SPARK-34133][AVRO] Respect case sensitivity when performing Catalyst-to-Avro field matching and enhance error messages

GitBox Fri, 15 Jan 2021 14:43:00 -0800


xkrogen opened a new pull request #31201:
URL: https://github.com/apache/spark/pull/31201



   ### What changes were proposed in this pull request?
   Make the field name matching between Avro and Catalyst schemas, on both the 
reader and writer paths, respect the global SQL settings for case sensitivity 
(i.e. case-insensitive by default). `AvroSerializer` and `AvroDeserializer` 
share a common utility in `AvroUtils` to search for an Avro field to match a 
given Catalyst field.
   
   Improve the error messages for incompatibilities between Avro and Catalyst 
schemas. First, make `AvroSerializer` more similar to `AvroDeserializer` in 
printing out contextual information such as hierarchical field names. 
Standardize exception messages in both serializer and deserializer to always 
include such contextual information, and include a top-level exception which 
shows the full schemas which were being parsed when the incompatibility was 
found. Both now print out the hierarchical name for both the Avro and Catalyst 
fields, since they may be different due to case sensitivity and Avro union 
handling.
   
   ### Why are the changes needed?
   Spark SQL is normally case-insensitive (by default), but currently when 
`AvroSerializer` and `AvroDeserializer` perform matching between Catalyst 
schemas and Avro schemas, the matching is done in a case-sensitive manner. So 
for example the following will fail:
   ```scala
         val avroSchema =
           """
             |{
             |  "type" : "record",
             |  "name" : "test_schema",
             |  "fields" : [
             |    {"name": "foo", "type": "int"},
             |    {"name": "BAR", "type": "int"}
             |  ]
             |}
         """.stripMargin
         val df = Seq((1, 3), (2, 4)).toDF("FOO", "bar")
   
         df.write.option("avroSchema", avroSchema).format("avro").save(savePath)
   ```
   
   The same is true on the read path, if we assume `testAvro` has been written 
using the schema above, the below will fail to match the fields:
   ```scala
   df.read.schema(new StructType().add("FOO", IntegerType).add("bar", 
IntegerType))
     .format("avro").load(testAvro)
   ```
   
   In addition the error messages in this type of failure scenario are very 
lacking in information on the write path (`AvroSerializer`). Below are two 
examples of messages that provide insufficient information to determine what 
went wrong (lacking in field names, context about the overall schema structure, 
etc.).
   ```
   org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert 
Catalyst type IntegerType to Avro type "float".
   
   org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert 
Catalyst type StructType(StructField(bar,IntegerType,true)) to Avro type 
{"type":"record","name":"test","fields":[{"name":"NOTbar","type":["null","int"],"default":null}]}.
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   When reading Avro data, or writing Avro data using the `avroSchema` option, 
field matching will be performed with case sensitivity respecting the global 
SQL settings.
   
   Error messages when there are incompatibilities between Avro and Catalyst 
schemas will be greatly improved on when writing Avro data using the 
`avroSchema` option, a little bit improved when reading Avro data, and much 
more consistent between the two.
   
   Below is an example of a new message. See `AvroSerdeSuite` for more examples.
   ```
   org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert 
Catalyst type 
StructType(StructField(foo,StructType(StructField(bar,IntegerType,true)),true)) 
to Avro type 
{"type":"record","name":"top","fields":[{"name":"foo","type":"int"}]}
        at 
org.apache.spark.sql.avro.AvroSerializer.liftedTree1$1(AvroSerializer.scala:83)
   ...
   Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot 
convert Catalyst field 'foo' to Avro field 'foo' because schema is incompatible 
(sqlType = StructType(StructField(bar,IntegerType,true)), avroType = "int")
        at 
org.apache.spark.sql.avro.AvroSerializer.newConverter(AvroSerializer.scala:230)
   ...
   ```
   
   ### How was this patch tested?
   New tests added to `AvroSuite` to validate the case sensitivity logic in an 
end-to-end manner through the SQL engine.
   
   New unit test suite, `AvroSerdeSuite`, was added to test corner cases on 
`AvroSerializer` and `AvroDeserializer` and verify that the exception messages 
are as expected.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] xkrogen opened a new pull request #31201: [SPARK-34133][AVRO] Respect case sensitivity when performing Catalyst-to-Avro field matching and enhance error messages

Reply via email to