GitHub user cloud-fan opened a pull request:
https://github.com/apache/spark/pull/19707
[SPARK-22472][SQL] add null check for top-level primitive values
## What changes were proposed in this pull request?
One powerful feature of `Dataset` is, we can easily map SQL rows to
Scala/Java objects and do runtime null check automatically.
For example, let's say we have a parquet file with schema `<a: int, b:
string>`, and we have a `case class Data(a: Int, b: String)`. Users can easily
read this parquet file into `Data` objects, and Spark will throw NPE if column
`a` has null values.
However the null checking is left behind for top-level primitive values.
For example, let's say we have a parquet file with schema `<a: Int>`, and we
read it into Scala `Int`. If column `a` has null values, we will get some weird
results.
```
scala> val ds = spark.read.parquet(...).as[Int]
scala> ds.show()
+----+
|v |
+----+
|null|
|1 |
+----+
scala> ds.collect
res0: Array[Long] = Array(0, 1)
scala> ds.map(_ * 2).show
+-----+
|value|
+-----+
|-2 |
|2 |
+-----+
```
This is because internally Spark use some special default values for
primitive types, but never expect users to see/operate these default value
directly.
This PR adds null check for top-level primitive values
## How was this patch tested?
new test
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/cloud-fan/spark bug
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19707.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19707
----
commit dad50806b27a40ed1112d8ee29b3bd5c60164170
Author: Wenchen Fan <[email protected]>
Date: 2017-11-09T13:39:10Z
add null check for top-level primitive values
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]