GitHub user HyukjinKwon opened a pull request:
https://github.com/apache/spark/pull/14124
[SPARK-16472][SQL] Inconsistent nullability in schema after being read in
SQL API
## What changes were proposed in this pull request?
It seems the data sources implementing `FileFormat` seems loading the data
by forcing the fields as nullable fields. (See here,
https://github.com/apache/spark/blob/7374e518e2641fddfe57003340db410224b37581/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L402-L409)
It seems this was official documented SPARK-11360 and was discussed here
https://www.mail-archive.com/[email protected]/msg39230.html
However, I realised that several APIs do not follow this. For example,
```scala
DataFrame.json(jsonRDD: RDD[String])
```
So, the codes below:
```scala
val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : null}"))
val schema = StructType(StructField("a", IntegerType, nullable = false) ::
Nil)
val df = spark.read.schema(schema).json(rdd)
df.printSchema()
```
prints below:
```
root
|-- a: integer (nullable = false)
```
This API loads the schema as it is after loading. However, the schema
became different when loading it by the API below (nullable fields) :
```
spark.read.format("json").schema(...).load(path).printSchema()
```
```
spark.read.schema(...).load(path).printSchema()
```
produce below:
```
root
|-- a: integer (nullable = true)
```
In addition, this is happening for structured streaming as well. (even when
we read batch after writing it by structured streaming).
I wrote some more such cases in the tests. (Structured Streaming). Please
refer the tests codes below.
## How was this patch tested?
Unit tests in `JsonSuite`, `FileStreamSinkSuite`, `FileStreamSourceSuite`,
`HadoopFsRelationTest`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/HyukjinKwon/spark SPARK-16472
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/14124.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #14124
----
commit a917678886779f236b1feffa23a11529ce67e97c
Author: hyukjinkwon <[email protected]>
Date: 2016-07-10T08:12:27Z
Inconsistent nullability in schema after being read in SQL API
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]