GitHub user maropu opened a pull request:
https://github.com/apache/spark/pull/21631
[SPARK-24645][SQL] Skip parsing when csvColumnPruning enabled and
partitions scanned only
## What changes were proposed in this pull request?
In the master, when `csvColumnPruning` enabled and partitions scanned only,
it throws an exception below;
```
scala> val dir = "/tmp/spark-csv/csv"
scala> spark.range(10).selectExpr("id % 2 AS p",
"id").write.mode("overwrite").partitionBy("p").csv(dir)
scala> spark.read.csv(dir).selectExpr("sum(p)").collect()
18/06/25 13:12:51 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 5)
java.lang.NullPointerException
at
org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:197)
at
org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:190)
at
org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:309)
at
org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:309)
at
org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:61)
...
```
This pr modified code to skip CSV parsing in the case.
## How was this patch tested?
Added tests in `CSVSuite`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/maropu/spark SPARK-24645
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21631.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21631
----
commit 59a7f142ae9c83c76c2bfbff2962c071fc586122
Author: Takeshi Yamamuro <yamamuro@...>
Date: 2018-06-25T04:18:37Z
fix
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]