Hyukjin Kwon created SPARK-16472:
------------------------------------
Summary: Inconsistent nullability in schema after being read in
SQL API.
Key: SPARK-16472
URL: https://issues.apache.org/jira/browse/SPARK-16472
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.0.0
Reporter: Hyukjin Kwon
It seems the data sources implementing {{FileFormat}} seems loading the data by
forcing the fields as nullable fields. It seems this was official documented
SPARK-11360 and was discussed here
https://www.mail-archive.com/[email protected]/msg39230.html
However, I realised that several APIs do not follow this. For example,
{code}
DataFrame.json(jsonRDD: RDD[String])
{code}
So, the codes below:
{code}
val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : null}"))
val schema = StructType(StructField("a", IntegerType, nullable = false) :: Nil)
val df = spark.read.schema(schema).json(rdd)
df.printSchema()
{code}
prints below:
{code}
root
|-- a: integer (nullable = false)
{code}
This API loads the schema as it is after loading. However, the schema became
different when loading it by the API below (nullable fields) :
{code}
spark.read.format("json").schema(...).load(path).printSchema()
{code}
{code}
spark.read.schema(...).load(path).printSchema()
{code}
produce below:
{code}
root
|-- a: integer (nullable = false)
{code}
In addition, this is happening for structured streaming as well. (even when we
read batch after writing it by structured streaming).
While testing, I wrote some tests codes and patches. Please see the following
PR for more cases.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]