[jira] [Comment Edited] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

koert kuipers (JIRA) Sat, 18 Aug 2018 12:21:35 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16584378#comment-16584378
 ]


koert kuipers edited comment on SPARK-17916 at 8/18/18 7:20 PM:
----------------------------------------------------------------

my first observation is that if i do this:
{code:scala}
val litNull: String = null
val df = Seq(
  (1, "John Doe"),
  (2, ""),
  (3, "-"),
  (4, litNull)
).toDF("id", "name")

df
  .write
  .csv("/tmp/abc1")
{code}
and inspect in bash
{code:bash}
cat /tmp/abc1/part-0000*.csv
1,John Doe
2,""
3,-
4,""
{code}
notice how for the null value it wrote the empty quoted string. that is 
emptyValue, not nullValue, which seems incorrect to me.

if i do the same exercise in spark 2.3 i get:
{code:bash}
cat /tmp/abc1/part-0000*.csv
1,John Doe
2,
3,-
4,
{code}

so my actual csv data has changed upon writing. that makes me nervous about 
compatibility with other systems that read data we produce.



was (Author: koert):
my first observation is that if i do this:
{code:scala}
val litNull: String = null
val df = Seq(
  (1, "John Doe"),
  (2, ""),
  (3, "-"),
  (4, litNull)
).toDF("id", "name")

df
  .write
  .csv("/tmp/abc1")
{code}
and inspect in bash
{code:bash}
cat /tmp/abc1/part-0000*.csv
1,John Doe
2,""
3,-
4,""
{code}
notice how that line has 4,""
if i do the same exercise in spark 2.3 i get:
{code:bash}
cat /tmp/abc1/part-0000*.csv
1,John Doe
2,
3,-
4,
{code}

so my actual csv data has changed upon writing. that makes me nervous about 
compatibility with other systems that read data we produce.


> CSV data source treats empty string as null no matter what nullValue option is
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-17916
>                 URL: https://issues.apache.org/jira/browse/SPARK-17916
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.1
>            Reporter: Hossein Falaki
>            Assignee: Maxim Gekk
>            Priority: Major
>             Fix For: 2.4.0
>
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

Reply via email to