[
https://issues.apache.org/jira/browse/SPARK-38167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17813741#comment-17813741
]
Nicholas Chammas commented on SPARK-38167:
------------------------------------------
[~marnixvandenbroek] - Could you link to the bug report you filed with
Univocity?
cc [~maxgekk] - I believe you have hit some parsing bugs in Univocity recently.
> CSV parsing error when using escape='"'
> ----------------------------------------
>
> Key: SPARK-38167
> URL: https://issues.apache.org/jira/browse/SPARK-38167
> Project: Spark
> Issue Type: Bug
> Components: PySpark, Spark Core
> Affects Versions: 3.2.1
> Environment: Pyspark on a single-node Databricks managed Spark 3.1.2
> cluster.
> Reporter: Marnix van den Broek
> Priority: Major
> Labels: correctness, csv, csvparser, data-integrity
>
> hi all,
> When reading CSV files with Spark, I ran into a parsing bug.
> {*}The summary{*}:
> When
> # reading a comma separated, double-quote quoted CSV file using the csv
> reader options _escape='"'_ and {_}header=True{_},
> # with a row containing a quoted empty field
> # followed by a quoted field starting with a comma and followed by one or
> more characters
> selecting columns from the dataframe at or after the field described in 3)
> gives incorrect and inconsistent results
> {*}In detail{*}:
> When I instruct Spark to read this CSV file:
>
> {code:java}
> col1,col2
> "",",a"
> {code}
>
> using the CSV reader options escape='"' (unnecessary for the example,
> necessary for the files I'm processing) and header=True, I expect the
> following result:
>
> {code:java}
> spark.read.csv(path, escape='"', header=True).show()
>
> +----+----+
> |col1|col2|
> +----+----+
> |null| ,a|
> +----+----+ {code}
>
> Spark does yield this result, so far so good. However, when I select col2
> from the dataframe, Spark yields an incorrect result:
>
> {code:java}
> spark.read.csv(path, escape='"', header=True).select('col2').show()
>
> +----+
> |col2|
> +----+
> | a"|
> +----+{code}
>
> If you run this example with more columns in the file, and more commas in the
> field, e.g. ",,,,,,,a", the problem compounds, as Spark shifts many values to
> the right, causing unexpected and incorrect results. The inconsistency
> between both methods surprised me, as it implies the parsing is evaluated
> differently between both methods.
> I expect the bug to be located in the quote-balancing and un-escaping methods
> of the csv parser, but I can't find where that code is located in the code
> base. I'd be happy to take a look at it if anyone can point me where it is.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]