Guo Wei created SPARK-37604: ------------------------------- Summary: The parameter emptyValueInRead is CSVOptions is not designed as supposed to be Key: SPARK-37604 URL: https://issues.apache.org/jira/browse/SPARK-37604 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Guo Wei
For null values, the parameter nullValue can be set when reading or writing in CSVOptions: {code:scala} // For writing, convert: null(dataframe) => nullValue(csv) writerSettings.setNullValue(nullValue) // For reading, convert: nullValue or ,,(csv) => null(dataframe) settings.setNullValue(nullValue) {code} For example, a column has null values, if nullValue is set to "null" string. {code:scala} Seq(("Tesla", null.asInstanceOf[String])).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} and if we read this csv file with nullValue set to "null" string. {code:java} spark.read.option("nullValue", "NULL").csv(path) {code} we can get the DataFrame which data is shown as: ||make||comment|| |tesla|null| {color:#57d9a3}*We can succeed to recovery it to the original DataFrame.*{color} Since Spark 2.4, for empty strings, there are emptyValueInRead for reading and emptyValueInWrite for writing that can be set in CSVOptions: {code:scala} // For writing, convert: ""(dataframe) => emptyValueInWrite(csv) writerSettings.setEmptyValue(emptyValueInWrite) // For reading, convert: "" (csv) => emptyValueInRead(dataframe) settings.setEmptyValue(emptyValueInRead) {code} I think the write handling is suitable, but for read handling, it supposed to be as flows: {code:scala} // in asParserSettings: "" or emptyValueInWrite (csv) =>""(dataframe) settings.setEmptyValue(emptyValueInRead) {code} For example, a column has empty strings, if emptyValueInWrite is set to "EMPTY" string. {code:scala} Seq(("Tesla", {code} {color:#910091}""{color} {code:scala} )).toDF("make", "comment").write.option("emptyValue", "EMPTY")csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} and if we read this csv file with emptyValueInRead set to "EMPTY" string. {code:java} spark.read.option("emptyValue", "EMPTY").csv(path) {code} we can get the DataFrame which data is shown as: ||make||comment|| |tesla|EMPTY| but the expected DataFrame which data shoudle be shown as: ||make||comment|| |tesla| {color:#de350b}*We can not recovery it to the original DataFrame.*{color} -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org