xuanyuanking commented on a change in pull request #24682: [SPARK-27838][SQL] 
Support user provided non-nullable avro schema for nullable catalyst schema 
without any null record
URL: https://github.com/apache/spark/pull/24682#discussion_r376970029
 
 

 ##########
 File path: docs/sql-migration-guide-upgrade.md
 ##########
 @@ -132,6 +132,10 @@ license: |
 
   - Since Spark 3.0, Spark will cast `String` to `Date/TimeStamp` in binary 
comparisons with dates/timestamps. The previous behaviour of casting 
`Date/Timestamp` to `String` can be restored by setting 
`spark.sql.legacy.typeCoercion.datetimeToString` to `true`.
 
+  - Since Spark 3.0, when Avro files are written with user provided schema, 
the fields will be matched by field names between catalyst schema and avro 
schema instead of positions.
+
+  - Since Spark 3.0, when Avro files are written with user provided 
non-nullable schema, even the catalyst schema is nullable, Spark is still able 
to write the files. However, Spark will throw runtime NPE if any of the records 
contains null.
 
 Review comment:
   After further investigation, I think the 3.0 behavior here is good enough, 
cause this behavior only take effect while no records contain null. Also, while 
any records contain null, the new approach will give a better exception.
   
   Let's use the UT added in this PR to illustrate:
   For the old behavior, we'll get exception 
`org.apache.avro.AvroRuntimeException: Not a union: "int"` which didn't express 
the problem of null data insertion.
   For the current approach, the error message will be `NullPointerException: 
in test_schema in string null of string in field Name of test_schema`.
   
   So if both legacy and non-legacy mode throw exception, and the legacy mode 
message is less clear than the new one, the legacy config might not necessary. 
WDYT?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to