Max Gekk created SPARK-36089:
--------------------------------

             Summary: Update the SQL migration guide about encoding 
auto-detection o CSV files 
                 Key: SPARK-36089
                 URL: https://issues.apache.org/jira/browse/SPARK-36089
             Project: Spark
          Issue Type: Documentation
          Components: SQL
    Affects Versions: 3.2.0
            Reporter: Max Gekk
            Assignee: Max Gekk


Need to update the SQL migration guide to inform users about behavior change.

*What*: Spark doesn't detect encoding (charset) in CSV files with BOM 
correctly. Such files can be read only in the multiLine mode when the CSV 
option encoding matches to the actual encoding of CSV files. For example, Spark 
cannot read UTF-16BE CSV files when encoding is set to UTF-8 which is the 
default mode. This is the case of the current ES ticket.

*Why*: In previous Spark versions, encoding wasn't propagated to the underlying 
library that means the lib tried to detect file encoding automatically. It 
could success for some encodings that require BOM presents at the beginning of 
files. Starting from the versions 3.0, users can specify file encoding via the 
CSV option encoding which has UTF-8 as the default value. Spark propagates such 
default to the underlying library (uniVocity), and as a consequence this turned 
off encoding autodetection.

*When*: Since Spark 3.0. In particular, the commit 
https://github.com/apache/spark/commit/2df34db586bec379e40b5cf30021f5b7a2d79271 
causes the issue.

*Workaround*: Enabling the encoding auto-detection mechanism in uniVocity by 
passing null as the value of CSV option encoding. A more recommended approach 
is to set the encoding option to UTF-16 explicitly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to