clairezhuang opened a new pull request, #38446:
URL: https://github.com/apache/spark/pull/38446
When the value of quote or escape exists in the content of csv file, the
character in the csv file will be misidentified
We found that when the value of quote or escape exists in the content of csv
file, the character in the csv file will be misidentified.
When this content is being read by Azure Data Factory copy activity and
written to CSV, the content is
"test\\\\" =
test"
we read csv as below:
df = spark.read.csv(path='test.csv'
, sep=','
, header=True
, quote='"'
, escape='\\'
, multiLine=True
, lineSep='\n'
)
resulting in the following being written to the CSV: **test\\" =** and
**test** in the next line ,but what we want **test\\\\" = test**.
Now when the above is being read by Spark:
1. The first \ is being interpreted as being an escaping of the second \ (so
the content looks like a single literal \)
2. The " now appears to be an unescaped quote character, so we're back in
the situation where Spark tries to handle this using STOP_AT_DELIMITER.
As before, the rest of the CSV after this point is being parsed incorrectly.
We could change the "quote,escape..." to avoid it for the scenario above,
but the content of their csv file is very large and it may occur any character.
the data sources that we have which are affected by this issue are systems
outside of our control, so we have no means of controlling what
content/characters will be there.When we change the "quote,escape...", it may
conflict with the content again, and it still have issues in the following
content.
As far as designing the content to avoid certain characters - the data
sources that we have which are affected by this issue are systems outside of
our control, so we have no means of controlling what content/characters will be
there.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]