MaxGekk opened a new pull request #33300:
URL: https://github.com/apache/spark/pull/33300


   ### What changes were proposed in this pull request?
   In the PR, I propose to update the SQL migration guide, in particular the 
section about the migration from Spark 2.4 to 3.0. New item informs users about 
the following issue:
   
   **What**: Spark doesn't detect encoding (charset) in CSV files with BOM 
correctly. Such files can be read only in the multiLine mode when the CSV 
option encoding matches to the actual encoding of CSV files. For example, Spark 
cannot read UTF-16BE CSV files when encoding is set to UTF-8 which is the 
default mode. This is the case of the current ES ticket.
   
   **Why**: In previous Spark versions, encoding wasn't propagated to the 
underlying library that means the lib tried to detect file encoding 
automatically. It could success for some encodings that require BOM presents at 
the beginning of files. Starting from the versions 3.0, users can specify file 
encoding via the CSV option encoding which has UTF-8 as the default value. 
Spark propagates such default to the underlying library (uniVocity), and as a 
consequence this turned off encoding auto-detection.
   
   **When**: Since Spark 3.0. In particular, the commit 
https://github.com/apache/spark/commit/2df34db586bec379e40b5cf30021f5b7a2d79271 
causes the issue.
   
   **Workaround**: Enabling the encoding auto-detection mechanism in uniVocity 
by passing null as the value of CSV option encoding. A more recommended 
approach is to set the encoding option explicitly.
   
   
   ### Why are the changes needed?
   To improve user experience with Spark SQL. This should help to users in 
their migration from Spark 2.4.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Should be checked by building docs in GA/jenkins.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to