Suresh Thalamati created SPARK-15125:
----------------------------------------
Summary: CSV data source recognizes empty quoted strings in the
input as null.
Key: SPARK-15125
URL: https://issues.apache.org/jira/browse/SPARK-15125
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.0.0
Reporter: Suresh Thalamati
CSV data source does not differentiate between empty quoted strings and empty
fields as null. In some scenarios user would want to differentiate between
these values, especially in the context of SQL where NULL , and empty string
have different meanings If input data happens to be dump from traditional
relational data source, users will see different results for the SQL queries.
{code}
Repro:
Test Data: (test.csv)
year,make,model,comment,price
2017,Tesla,Mode 3,looks nice.,35000.99
2016,Chevy,Bolt,"",29000.00
2015,Porsche,"",,
scala> val df= sqlContext.read.format("csv").option("header",
"true").option("inferSchema", "true").option("nullValue",
null).load("/tmp/test.csv")
df: org.apache.spark.sql.DataFrame = [year: int, make: string ... 3 more fields]
scala> df.show
+----+-------+------+-----------+--------+
|year| make| model| comment| price|
+----+-------+------+-----------+--------+
|2017| Tesla|Mode 3|looks nice.|35000.99|
|2016| Chevy| Bolt| null| 29000.0|
|2015|Porsche| null| null| null|
+----+-------+------+-----------+--------+
Expected:
+----+-------+------+-----------+--------+
|year| make| model| comment| price|
+----+-------+------+-----------+--------+
|2017| Tesla|Mode 3|looks nice.|35000.99|
|2016| Chevy| Bolt| | 29000.0|
|2015|Porsche| | null| null|
+----+-------+------+-----------+--------+
{code}
Testing a fix for the this issue. I will give a shot at submitting a PR for
this soon.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]