[jira] [Commented] (SPARK-40982) When the value of quote or escape exists in the content of csv file, the character in the csv file will be misidentified

Apache Spark (Jira) Mon, 31 Oct 2022 22:11:07 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-40982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626963#comment-17626963
 ]


Apache Spark commented on SPARK-40982:
--------------------------------------

User 'clairezhuang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38446

> When the value of quote or escape exists in the content of csv file, the 
> character in the csv file will be misidentified
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-40982
>                 URL: https://issues.apache.org/jira/browse/SPARK-40982
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 3.2.1
>            Reporter: clairezhuang
>            Priority: Minor
>
> When the value of quote or escape exists in the content of csv file, the 
> character in the csv file will be misidentified
> We found that when the value of quote or escape exists in the content of csv 
> file, the character in the csv file will be misidentified.
> When this content is being read by Azure Data Factory copy activity and 
> written to CSV, the content is
> "test\\" =
> test"
> we read csv as below:
> df = spark.read.csv(path='test.csv'
> , sep=','
> , header=True
> , quote='"'
> , escape='\'
> , multiLine=True
> , lineSep='\n'
> )
> resulting in the following being written to the CSV: *test\" =* and *test* in 
> the next line ,but what we want {*}test\\" = test{*}.
> Now when the above is being read by Spark:
>  # The first \ is being interpreted as being an escaping of the second \ (so 
> the content looks like a single literal )
>  # The " now appears to be an unescaped quote character, so we're back in the 
> situation where Spark tries to handle this using STOP_AT_DELIMITER.
> As before, the rest of the CSV after this point is being parsed incorrectly.
> We could change the "quote,escape..." to avoid it for the scenario above, but 
> the content of their csv file is very large and it may occur any character. 
> the data sources that we have which are affected by this issue are systems 
> outside of our control, so we have no means of controlling what 
> content/characters will be there.When we change the "quote,escape...", it may 
> conflict with the content again, and it still have issues in the following 
> content.
> As far as designing the content to avoid certain characters - the data 
> sources that we have which are affected by this issue are systems outside of 
> our control, so we have no means of controlling what content/characters will 
> be there.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-40982) When the value of quote or escape exists in the content of csv file, the character in the csv file will be misidentified

Reply via email to