[ 
https://issues.apache.org/jira/browse/SPARK-36089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-36089:
-----------------------------
    Affects Version/s: 3.0.4
                       3.1.3

> Update the SQL migration guide about encoding auto-detection of CSV files 
> --------------------------------------------------------------------------
>
>                 Key: SPARK-36089
>                 URL: https://issues.apache.org/jira/browse/SPARK-36089
>             Project: Spark
>          Issue Type: Documentation
>          Components: SQL
>    Affects Versions: 3.2.0, 3.1.3, 3.0.4
>            Reporter: Max Gekk
>            Assignee: Max Gekk
>            Priority: Major
>
> Need to update the SQL migration guide to inform users about behavior change.
> *What*: Spark doesn't detect encoding (charset) in CSV files with BOM 
> correctly. Such files can be read only in the multiLine mode when the CSV 
> option encoding matches to the actual encoding of CSV files. For example, 
> Spark cannot read UTF-16BE CSV files when encoding is set to UTF-8 which is 
> the default mode. This is the case of the current ES ticket.
> *Why*: In previous Spark versions, encoding wasn't propagated to the 
> underlying library that means the lib tried to detect file encoding 
> automatically. It could success for some encodings that require BOM presents 
> at the beginning of files. Starting from the versions 3.0, users can specify 
> file encoding via the CSV option encoding which has UTF-8 as the default 
> value. Spark propagates such default to the underlying library (uniVocity), 
> and as a consequence this turned off encoding autodetection.
> *When*: Since Spark 3.0. In particular, the commit 
> https://github.com/apache/spark/commit/2df34db586bec379e40b5cf30021f5b7a2d79271
>  causes the issue.
> *Workaround*: Enabling the encoding auto-detection mechanism in uniVocity by 
> passing null as the value of CSV option encoding. A more recommended approach 
> is to set the encoding option to UTF-16 explicitly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to