Hyukjin Kwon created SPARK-16472:
------------------------------------

             Summary: Inconsistent nullability in schema after being read in 
SQL API.
                 Key: SPARK-16472
                 URL: https://issues.apache.org/jira/browse/SPARK-16472
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.0.0
            Reporter: Hyukjin Kwon


It seems the data sources implementing {{FileFormat}} seems loading the data by 
forcing the fields as nullable fields. It seems this was official documented 
SPARK-11360 and was discussed here 
https://www.mail-archive.com/[email protected]/msg39230.html

However, I realised that several APIs do not follow this. For example,

{code}
DataFrame.json(jsonRDD: RDD[String])
{code}

So, the codes below:

{code}
val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : null}"))
val schema = StructType(StructField("a", IntegerType, nullable = false) :: Nil)
val df = spark.read.schema(schema).json(rdd)
df.printSchema()
{code}

prints below:

{code}
root
 |-- a: integer (nullable = false)
{code}

This API loads the schema as it is after loading. However, the schema became 
different when loading it by the API below (nullable fields) :

{code}
spark.read.format("json").schema(...).load(path).printSchema()
{code}

{code}
spark.read.schema(...).load(path).printSchema()
{code}

produce below:

{code}
root
 |-- a: integer (nullable = false)
{code}


In addition, this is happening for structured streaming as well. (even when we 
read batch after writing it by structured streaming).

While testing, I wrote some tests codes and patches. Please see the following 
PR for more cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to