clairezhuang opened a new pull request, #38446:
URL: https://github.com/apache/spark/pull/38446

   When the value of quote or escape exists in the content of csv file, the 
character in the csv file will be misidentified
   We found that when the value of quote or escape exists in the content of csv 
file, the character in the csv file will be misidentified. 
   When this content is being read by Azure Data Factory copy activity and 
written to CSV, the content is 
   "test\\\\" =
   test"
   we read csv as below: 
   df = spark.read.csv(path='test.csv'
                                , sep=','
                                , header=True
                                , quote='"'
                                , escape='\\'
                                , multiLine=True
                                , lineSep='\n'
                               )
    resulting in the following being written to the CSV:  **test\\" =** and 
**test** in the next line ,but what we want  **test\\\\" = test**.
   Now when the above is being read by Spark: 
   1. The first \ is being interpreted as being an escaping of the second \ (so 
the content looks like a single literal \) 
   2. The " now appears to be an unescaped quote character, so we're back in 
the situation where Spark tries to handle this using STOP_AT_DELIMITER. 
   As before, the rest of the CSV after this point is being parsed incorrectly. 
   We could change the "quote,escape..." to avoid it for the scenario above, 
but the content of their csv file is very large and it may occur any character. 
the data sources that we have which are affected by this issue are systems 
outside of our control, so we have no means of controlling what 
content/characters will be there.When we change the "quote,escape...", it may 
conflict with the content again, and it still have issues in the following 
content. 
   As far as designing the content to avoid certain characters - the data 
sources that we have which are affected by this issue are systems outside of 
our control, so we have no means of controlling what content/characters will be 
there. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to