[ 
https://issues.apache.org/jira/browse/SPARK-16472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-16472:
---------------------------------
    Summary: Inconsistent nullability in schema after being read  (was: 
Inconsistent nullability in schema after being read in SQL API.)

> Inconsistent nullability in schema after being read
> ---------------------------------------------------
>
>                 Key: SPARK-16472
>                 URL: https://issues.apache.org/jira/browse/SPARK-16472
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Hyukjin Kwon
>            Priority: Minor
>
> It seems the data sources implementing {{FileFormat}} seems loading the data 
> by forcing the fields as nullable fields. It seems this was official 
> documented SPARK-11360 and was discussed here 
> https://www.mail-archive.com/[email protected]/msg39230.html
> However, I realised that several APIs do not follow this. For example,
> {code}
> DataFrame.json(jsonRDD: RDD[String])
> {code}
> So, the codes below:
> {code}
> val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : null}"))
> val schema = StructType(StructField("a", IntegerType, nullable = false) :: 
> Nil)
> val df = spark.read.schema(schema).json(rdd)
> df.printSchema()
> {code}
> prints below:
> {code}
> root
>  |-- a: integer (nullable = false)
> {code}
> This API loads the schema as it is after loading. However, the schema became 
> different when loading it by the API below (nullable fields) :
> {code}
> spark.read.format("json").schema(...).load(path).printSchema()
> {code}
> {code}
> spark.read.schema(...).load(path).printSchema()
> {code}
> produce below:
> {code}
> root
>  |-- a: integer (nullable = true)
> {code}
> In addition, this is happening for structured streaming as well. (even when 
> we read batch after writing it by structured streaming).
> While testing, I wrote some tests codes and patches. Please see the following 
> PR for more cases.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to