[
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16584378#comment-16584378
]
koert kuipers edited comment on SPARK-17916 at 8/18/18 7:20 PM:
----------------------------------------------------------------
my first observation is that if i do this:
{code:scala}
val litNull: String = null
val df = Seq(
(1, "John Doe"),
(2, ""),
(3, "-"),
(4, litNull)
).toDF("id", "name")
df
.write
.csv("/tmp/abc1")
{code}
and inspect in bash
{code:bash}
cat /tmp/abc1/part-0000*.csv
1,John Doe
2,""
3,-
4,""
{code}
notice how for the null value it wrote the empty quoted string. that is
emptyValue, not nullValue, which seems incorrect to me.
if i do the same exercise in spark 2.3 i get:
{code:bash}
cat /tmp/abc1/part-0000*.csv
1,John Doe
2,
3,-
4,
{code}
so my actual csv data has changed upon writing. that makes me nervous about
compatibility with other systems that read data we produce.
was (Author: koert):
my first observation is that if i do this:
{code:scala}
val litNull: String = null
val df = Seq(
(1, "John Doe"),
(2, ""),
(3, "-"),
(4, litNull)
).toDF("id", "name")
df
.write
.csv("/tmp/abc1")
{code}
and inspect in bash
{code:bash}
cat /tmp/abc1/part-0000*.csv
1,John Doe
2,""
3,-
4,""
{code}
notice how that line has 4,""
if i do the same exercise in spark 2.3 i get:
{code:bash}
cat /tmp/abc1/part-0000*.csv
1,John Doe
2,
3,-
4,
{code}
so my actual csv data has changed upon writing. that makes me nervous about
compatibility with other systems that read data we produce.
> CSV data source treats empty string as null no matter what nullValue option is
> ------------------------------------------------------------------------------
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.0.1
> Reporter: Hossein Falaki
> Assignee: Maxim Gekk
> Priority: Major
> Fix For: 2.4.0
>
>
> When user configures {{nullValue}} in CSV data source, in addition to those
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]