Repository: spark Updated Branches: refs/heads/master 7f3c6ff4f -> 02a4386ae
[SPARK-20978][SQL] Bump up Univocity version to 2.5.4 ## What changes were proposed in this pull request? There was a bug in Univocity Parser that causes the issue in SPARK-20978. This was fixed as below: ```scala val df = spark.read.schema("a string, b string, unparsed string").option("columnNameOfCorruptRecord", "unparsed").csv(Seq("a").toDS()) df.show() ``` **Before** ``` java.lang.NullPointerException at scala.collection.immutable.StringLike$class.stripLineEnd(StringLike.scala:89) at scala.collection.immutable.StringOps.stripLineEnd(StringOps.scala:29) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$getCurrentInput(UnivocityParser.scala:56) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207) ... ``` **After** ``` +---+----+--------+ | a| b|unparsed| +---+----+--------+ | a|null| a| +---+----+--------+ ``` It was fixed in 2.5.0 and 2.5.4 was released. I guess it'd be safe to upgrade this. ## How was this patch tested? Unit test added in `CSVSuite.scala`. Author: hyukjinkwon <gurwls...@gmail.com> Closes #19113 from HyukjinKwon/bump-up-univocity. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/02a4386a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/02a4386a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/02a4386a Branch: refs/heads/master Commit: 02a4386aec5f83f41ca1abc5f56e223b6fae015c Parents: 7f3c6ff Author: hyukjinkwon <gurwls...@gmail.com> Authored: Tue Sep 5 23:21:43 2017 +0800 Committer: Wenchen Fan <wenc...@databricks.com> Committed: Tue Sep 5 23:21:43 2017 +0800 ---------------------------------------------------------------------- dev/deps/spark-deps-hadoop-2.6 | 2 +- dev/deps/spark-deps-hadoop-2.7 | 2 +- sql/core/pom.xml | 2 +- .../spark/sql/execution/datasources/csv/CSVSuite.scala | 8 ++++++++ 4 files changed, 11 insertions(+), 3 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/02a4386a/dev/deps/spark-deps-hadoop-2.6 ---------------------------------------------------------------------- diff --git a/dev/deps/spark-deps-hadoop-2.6 b/dev/deps/spark-deps-hadoop-2.6 index 1535103..e3b9ce0 100644 --- a/dev/deps/spark-deps-hadoop-2.6 +++ b/dev/deps/spark-deps-hadoop-2.6 @@ -182,7 +182,7 @@ stax-api-1.0.1.jar stream-2.7.0.jar stringtemplate-3.2.1.jar super-csv-2.2.0.jar -univocity-parsers-2.2.1.jar +univocity-parsers-2.5.4.jar validation-api-1.1.0.Final.jar xbean-asm5-shaded-4.4.jar xercesImpl-2.9.1.jar http://git-wip-us.apache.org/repos/asf/spark/blob/02a4386a/dev/deps/spark-deps-hadoop-2.7 ---------------------------------------------------------------------- diff --git a/dev/deps/spark-deps-hadoop-2.7 b/dev/deps/spark-deps-hadoop-2.7 index deaa288..a3f3f32 100644 --- a/dev/deps/spark-deps-hadoop-2.7 +++ b/dev/deps/spark-deps-hadoop-2.7 @@ -183,7 +183,7 @@ stax-api-1.0.1.jar stream-2.7.0.jar stringtemplate-3.2.1.jar super-csv-2.2.0.jar -univocity-parsers-2.2.1.jar +univocity-parsers-2.5.4.jar validation-api-1.1.0.Final.jar xbean-asm5-shaded-4.4.jar xercesImpl-2.9.1.jar http://git-wip-us.apache.org/repos/asf/spark/blob/02a4386a/sql/core/pom.xml ---------------------------------------------------------------------- diff --git a/sql/core/pom.xml b/sql/core/pom.xml index 9a3cacb..7ee002e 100644 --- a/sql/core/pom.xml +++ b/sql/core/pom.xml @@ -38,7 +38,7 @@ <dependency> <groupId>com.univocity</groupId> <artifactId>univocity-parsers</artifactId> - <version>2.2.1</version> + <version>2.5.4</version> <type>jar</type> </dependency> <dependency> http://git-wip-us.apache.org/repos/asf/spark/blob/02a4386a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ---------------------------------------------------------------------- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala index 243a55c..be89141 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala @@ -1195,4 +1195,12 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils { .csv(Seq("10u12").toDS()) checkAnswer(results, Row(null)) } + + test("SPARK-20978: Fill the malformed column when the number of tokens is less than schema") { + val df = spark.read + .schema("a string, b string, unparsed string") + .option("columnNameOfCorruptRecord", "unparsed") + .csv(Seq("a").toDS()) + checkAnswer(df, Row("a", null, "a")) + } } --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org