[jira] [Updated] (SPARK-37604) The parameter emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

Wei Guo (Jira) Wed, 15 Dec 2021 07:03:05 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wei Guo updated SPARK-37604:
----------------------------
    Description: 
Csv data format is 
For null values, the parameter nullValue can be set when reading or writing  in 
CSVOptions:
{code:scala}
// For writing, convert: null(dataframe) => nullValue(csv)

// For reading, convert: nullValue or ,,(csv) => null(dataframe)
{code}
For  example, a column has null values, if nullValue is set to "null" string.
{code:scala}
Seq(("Tesla", null.asInstanceOf[String])).toDF("make", 
"comment").write.option("nullValue", "NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
and if we read this csv file with nullValue set to "null" string.
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
we can get the DataFrame which data is same with the original shown as:
||make||comment||
|tesla|null|

{color:#57d9a3}*We can succeed to recovery it to the original DataFrame.*{color}

 

Since Spark 2.4, for empty strings, there are  emptyValueInRead for reading and 
emptyValueInWrite for writing that can be set in CSVOptions:
{code:scala}
// For writing, convert: ""(dataframe) => emptyValueInWrite(csv)

// For reading, convert: "" (csv) => emptyValueInRead(dataframe){code}
I think the read handling is not suitable, we can not convert "" or 
`{color:#172b4d}emptyValueInWrite`{color} values as ""(real empty strings) but 
get {color:#172b4d}emptyValueInRead's setting value actually{color}, it 
supposed to be as flows:
{code:scala}
// For reading, convert: "" or emptyValueInRead (csv) => ""(dataframe){code}
For example,  a column has empty strings, if emptyValueInWrite is set to 
"EMPTY" string.
{code:scala}
Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
"EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
and if we read this csv file with emptyValue(emptyValueInRead) set to "EMPTY" 
string.
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
we actually get the DataFrame which data is shown as:
||make||comment||
|tesla|EMPTY|

but the DataFrame which data should be shown as below as  expected:
||make||comment||
|tesla| |

{color:#de350b}*We can not  recovery it to the original DataFrame.*{color}

  was:
For null values, the parameter nullValue can be set when reading or writing  in 
CSVOptions:
{code:scala}
// For writing, convert: null(dataframe) => nullValue(csv)

// For reading, convert: nullValue or ,,(csv) => null(dataframe)
{code}
For  example, a column has null values, if nullValue is set to "null" string.
{code:scala}
Seq(("Tesla", null.asInstanceOf[String])).toDF("make", 
"comment").write.option("nullValue", "NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
and if we read this csv file with nullValue set to "null" string.
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
we can get the DataFrame which data is same with the original shown as:
||make||comment||
|tesla|null|

{color:#57d9a3}*We can succeed to recovery it to the original DataFrame.*{color}

 

Since Spark 2.4, for empty strings, there are  emptyValueInRead for reading and 
emptyValueInWrite for writing that can be set in CSVOptions:
{code:scala}
// For writing, convert: ""(dataframe) => emptyValueInWrite(csv)

// For reading, convert: "" (csv) => emptyValueInRead(dataframe){code}
I think the read handling is not suitable, we can not convert "" or 
`{color:#172b4d}emptyValueInWrite`{color} values as ""(real empty strings) but 
get {color:#172b4d}emptyValueInRead's setting value actually{color}, it 
supposed to be as flows:
{code:scala}
// For reading, convert: "" or emptyValueInRead (csv) => ""(dataframe){code}
For example,  a column has empty strings, if emptyValueInWrite is set to 
"EMPTY" string.
{code:scala}
Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
"EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
and if we read this csv file with emptyValue(emptyValueInRead) set to "EMPTY" 
string.
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
we actually get the DataFrame which data is shown as:
||make||comment||
|tesla|EMPTY|

but the DataFrame which data should be shown as below as  expected:
||make||comment||
|tesla| |

{color:#de350b}*We can not  recovery it to the original DataFrame.*{color}


> The parameter emptyValueInRead(in CSVOptions) is suggested to be designed as 
> that any fields matching this string will be set as empty values "" when 
> reading
> -------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-37604
>                 URL: https://issues.apache.org/jira/browse/SPARK-37604
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0, 3.2.0
>            Reporter: Wei Guo
>            Priority: Major
>
> Csv data format is 
> For null values, the parameter nullValue can be set when reading or writing  
> in CSVOptions:
> {code:scala}
> // For writing, convert: null(dataframe) => nullValue(csv)
> // For reading, convert: nullValue or ,,(csv) => null(dataframe)
> {code}
> For  example, a column has null values, if nullValue is set to "null" string.
> {code:scala}
> Seq(("Tesla", null.asInstanceOf[String])).toDF("make", 
> "comment").write.option("nullValue", "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> and if we read this csv file with nullValue set to "null" string.
> {code:java}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> we can get the DataFrame which data is same with the original shown as:
> ||make||comment||
> |tesla|null|
> {color:#57d9a3}*We can succeed to recovery it to the original 
> DataFrame.*{color}
>  
> Since Spark 2.4, for empty strings, there are  emptyValueInRead for reading 
> and emptyValueInWrite for writing that can be set in CSVOptions:
> {code:scala}
> // For writing, convert: ""(dataframe) => emptyValueInWrite(csv)
> // For reading, convert: "" (csv) => emptyValueInRead(dataframe){code}
> I think the read handling is not suitable, we can not convert "" or 
> `{color:#172b4d}emptyValueInWrite`{color} values as ""(real empty strings) 
> but get {color:#172b4d}emptyValueInRead's setting value actually{color}, it 
> supposed to be as flows:
> {code:scala}
> // For reading, convert: "" or emptyValueInRead (csv) => ""(dataframe){code}
> For example,  a column has empty strings, if emptyValueInWrite is set to 
> "EMPTY" string.
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> and if we read this csv file with emptyValue(emptyValueInRead) set to "EMPTY" 
> string.
> {code:java}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> we actually get the DataFrame which data is shown as:
> ||make||comment||
> |tesla|EMPTY|
> but the DataFrame which data should be shown as below as  expected:
> ||make||comment||
> |tesla| |
> {color:#de350b}*We can not  recovery it to the original DataFrame.*{color}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37604) The parameter emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

Reply via email to