[
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16584324#comment-16584324
]
koert kuipers edited comment on SPARK-17916 at 8/17/18 8:05 PM:
----------------------------------------------------------------
the default behavior in 2.3.x for csv format is that when i write out null
value, it comes back in as null. when i write out empty string, it also comes
back in as null.
now my nulls are coming back in as empty strings, which would be a very big
behavior change. please advice what settings i need to get behavior of 2.3
back, so empty strings read back in as nulls.
to give some background, most csv files have empty values. we have hundreds of
spark scripts/programs that read existing csv files and assume empty values are
read in as null values, and these programs act/analyze accordingly. i don't
think we are alone in this respect. for all these programs this would be a big
breaking change, unless i am missing something.
was (Author: koert):
the default behavior in 2.3.x for csv format is that when i write out null
value, it comes back in as null. when i write out empty string, it also comes
back in as null.
now my nulls are coming back in as empty strings, which would be a very big
behavior change. please advice what settings i need to get behavior of 2.3
back, so empty strings read back in as nulls.
to give some background, most csv files have empty values. we have hundreds of
spark scripts/programs that assume these are read in as null values and these
programs act/analyze accordingly. i dont think we are alone in this respect.
for all these programs this would be a big breaking change, unless i am missing
something.
> CSV data source treats empty string as null no matter what nullValue option is
> ------------------------------------------------------------------------------
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.0.1
> Reporter: Hossein Falaki
> Assignee: Maxim Gekk
> Priority: Major
> Fix For: 2.4.0
>
>
> When user configures {{nullValue}} in CSV data source, in addition to those
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]