[jira] [Comment Edited] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

koert kuipers (JIRA) Fri, 17 Aug 2018 13:06:32 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16584324#comment-16584324
 ]


koert kuipers edited comment on SPARK-17916 at 8/17/18 8:05 PM:
----------------------------------------------------------------

the default behavior in 2.3.x for csv format is that when i write out null 
value, it comes back in as null. when i write out empty string, it also comes 
back in as null.

now my nulls are coming back in as empty strings, which would be a very big 
behavior change. please advice what settings i need to get behavior of 2.3 
back, so empty strings read back in as nulls.

to give some background, most csv files have empty values. we have hundreds of 
spark scripts/programs that read existing csv files and assume empty values are 
read in as null values, and these programs act/analyze accordingly. i don't 
think we are alone in this respect. for all these programs this would be a big 
breaking change, unless i am missing something.


was (Author: koert):
the default behavior in 2.3.x for csv format is that when i write out null 
value, it comes back in as null. when i write out empty string, it also comes 
back in as null.

now my nulls are coming back in as empty strings, which would be a very big 
behavior change. please advice what settings i need to get behavior of 2.3 
back, so empty strings read back in as nulls.

to give some background, most csv files have empty values. we have hundreds of 
spark scripts/programs that assume these are read in as null values and these 
programs act/analyze accordingly. i dont think we are alone in this respect. 
for all these programs this would be a big breaking change, unless i am missing 
something.

> CSV data source treats empty string as null no matter what nullValue option is
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-17916
>                 URL: https://issues.apache.org/jira/browse/SPARK-17916
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.1
>            Reporter: Hossein Falaki
>            Assignee: Maxim Gekk
>            Priority: Major
>             Fix For: 2.4.0
>
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

Reply via email to