[
https://issues.apache.org/jira/browse/SPARK-46959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867463#comment-17867463
]
Álvaro Marques Macêdo edited comment on SPARK-46959 at 7/20/24 2:07 AM:
------------------------------------------------------------------------
Hey, I did some search on this error and I believe that it's a bug, but it
isn't a bug caused by the logic within spark. In my findings I saw that the
parser used in the backend
{code:java}
com.univocity.parsers.csv.CsvParser{code}
is the one that have this behavior:
{code:java}
val settings = new CsvParserSettings()
/* Reproducing the same settings that spark does in the backend when
`spark.option("escape", "")` is used
according to
`apache/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala`
*/
settings.getFormat.setQuoteEscape("\u0000".charAt(0))
val parser = new CsvParser(settings)
parser.parseLine("\"\"") // outputs `res5: Array[String] = Array(")`
{code}
I can see a workaround, but should this be tackled by the team that maintains
univocity then?
was (Author: JIRAUSER306260):
Hey, I did some search on this error and I believe that it's a bug, but it
isn't a bug caused by the logic within spark. In my findings I saw that the
parser used in the backend `com.univocity.parsers.csv.CsvParser` is the one
that have this behavior:
```scala
val settings = new CsvParserSettings()
/* Reproducing the same settings that spark does in the backend when
`spark.option("escape", "")` is used
* according to
`apache/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala`
*/
settings.getFormat.setQuoteEscape("\u0000".charAt(0))
val parser = new CsvParser(settings)
parser.parseLine("\"\"") // outputs `res5: Array[String] = Array(")`
```
I can see a workaround, but should this be tackled by the team that maintains
univocity then?
> CSV reader reads data inconsistently depending on column position
> -----------------------------------------------------------------
>
> Key: SPARK-46959
> URL: https://issues.apache.org/jira/browse/SPARK-46959
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 3.4.1
> Reporter: Martin Rueckl
> Priority: Critical
>
> Reading the following CSV
> {code:java}
> "a";"b";"c";"d"
> 10;100,00;"Some;String";"ok"
> 20;200,00;"";"still ok"
> 30;300,00;"also ok";""
> 40;400,00;"";"" {code}
> with these options
> {code:java}
> spark.read
> .option("header","true")
> .option("sep",";")
> .option("encoding","ISO-8859-1")
> .option("lineSep","\r\n")
> .option("nullValue","")
> .option("quote",'"')
> .option("escape","") {code}
> results in the followin inconsistent dataframe
>
> ||a||b||c||d||
> |10|100,00|Some;String|ok|
> |20|200,00|<null>|still ok|
> |30|300,00|also ok|"|
> |40|400,00|<null>|"|
> As one can see, the quoted empty fields of the last column are not correctly
> read as null but instead contain a single double quote. It works for column c.
> If I recall correctly, this only happens when the "escape" option is set to
> an empty string. Not setting it to "" (defaults to "\") seems to not cause
> this bug.
> I observed this on databricks spark runtime 13.2 (think that is spark 3.4.1).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]