[
https://issues.apache.org/jira/browse/SPARK-40982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626963#comment-17626963
]
Apache Spark commented on SPARK-40982:
--------------------------------------
User 'clairezhuang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38446
> When the value of quote or escape exists in the content of csv file, the
> character in the csv file will be misidentified
> ------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-40982
> URL: https://issues.apache.org/jira/browse/SPARK-40982
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Affects Versions: 3.2.1
> Reporter: clairezhuang
> Priority: Minor
>
> When the value of quote or escape exists in the content of csv file, the
> character in the csv file will be misidentified
> We found that when the value of quote or escape exists in the content of csv
> file, the character in the csv file will be misidentified.
> When this content is being read by Azure Data Factory copy activity and
> written to CSV, the content is
> "test\\" =
> test"
> we read csv as below:
> df = spark.read.csv(path='test.csv'
> , sep=','
> , header=True
> , quote='"'
> , escape='\'
> , multiLine=True
> , lineSep='\n'
> )
> resulting in the following being written to the CSV: *test\" =* and *test* in
> the next line ,but what we want {*}test\\" = test{*}.
> Now when the above is being read by Spark:
> # The first \ is being interpreted as being an escaping of the second \ (so
> the content looks like a single literal )
> # The " now appears to be an unescaped quote character, so we're back in the
> situation where Spark tries to handle this using STOP_AT_DELIMITER.
> As before, the rest of the CSV after this point is being parsed incorrectly.
> We could change the "quote,escape..." to avoid it for the scenario above, but
> the content of their csv file is very large and it may occur any character.
> the data sources that we have which are affected by this issue are systems
> outside of our control, so we have no means of controlling what
> content/characters will be there.When we change the "quote,escape...", it may
> conflict with the content again, and it still have issues in the following
> content.
> As far as designing the content to avoid certain characters - the data
> sources that we have which are affected by this issue are systems outside of
> our control, so we have no means of controlling what content/characters will
> be there.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]