Herman van Hovell created SPARK-23173:
-----------------------------------------
Summary: from_json can produce nulls for fields which are marked
as non-nullable
Key: SPARK-23173
URL: https://issues.apache.org/jira/browse/SPARK-23173
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.2.1
Reporter: Herman van Hovell
The {{from_json}} function uses a schema to convert a string into a Spark SQL
struct. This schema can contain non-nullable fields. The underlying
{{JsonToStructs}} expression does not check if a resulting struct respects the
nullability of the schema. This leads to very weird problems in consuming
expressions. In our case parquet writing would produce an illegal parquet file.
There are roughly solutions here:
# Assume that each field in schema passed to {{from_json}} is nullable, and
ignore the nullability information set in the passed schema.
# Validate the object during runtime, and fail execution if the data is null
where we are not expecting this.
I currently am slightly in favor of option 1, since this is the more performant
option and a lot easier to do. WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon]
[~brkyvz]]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]