[ 
https://issues.apache.org/jira/browse/SPARK-23173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-23173:
--------------------------------------
    Description: 
The {{from_json}} function uses a schema to convert a string into a Spark SQL 
struct. This schema can contain non-nullable fields. The underlying 
{{JsonToStructs}} expression does not check if a resulting struct respects the 
nullability of the schema. This leads to very weird problems in consuming 
expressions. In our case parquet writing would produce an illegal parquet file.

There are roughly solutions here:
 # Assume that each field in schema passed to {{from_json}} is nullable, and 
ignore the nullability information set in the passed schema.
 # Validate the object during runtime, and fail execution if the data is null 
where we are not expecting this.
I currently am slightly in favor of option 1, since this is the more performant 
option and a lot easier to do.

WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] [~brkyvz]]

  was:
The {{from_json}} function uses a schema to convert a string into a Spark SQL 
struct. This schema can contain non-nullable fields. The underlying 
{{JsonToStructs}} expression does not check if a resulting struct respects the 
nullability of the schema. This leads to very weird problems in consuming 
expressions. In our case parquet writing would produce an illegal parquet file.

There are roughly solutions here:
 # Assume that each field in schema passed to {{from_json}} is nullable, and 
ignore the nullability information set in the passed schema.
 # Validate the object during runtime, and fail execution if the data is null 
where we are not expecting this.
I currently am slightly in favor of option 1, since this is the more performant 
option and a lot easier to do. WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] 
[~brkyvz]]


> from_json can produce nulls for fields which are marked as non-nullable
> -----------------------------------------------------------------------
>
>                 Key: SPARK-23173
>                 URL: https://issues.apache.org/jira/browse/SPARK-23173
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.1
>            Reporter: Herman van Hovell
>            Priority: Major
>
> The {{from_json}} function uses a schema to convert a string into a Spark SQL 
> struct. This schema can contain non-nullable fields. The underlying 
> {{JsonToStructs}} expression does not check if a resulting struct respects 
> the nullability of the schema. This leads to very weird problems in consuming 
> expressions. In our case parquet writing would produce an illegal parquet 
> file.
> There are roughly solutions here:
>  # Assume that each field in schema passed to {{from_json}} is nullable, and 
> ignore the nullability information set in the passed schema.
>  # Validate the object during runtime, and fail execution if the data is null 
> where we are not expecting this.
> I currently am slightly in favor of option 1, since this is the more 
> performant option and a lot easier to do.
> WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] [~brkyvz]]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to