[
https://issues.apache.org/jira/browse/SPARK-36089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Max Gekk updated SPARK-36089:
-----------------------------
Affects Version/s: 3.0.4
3.1.3
> Update the SQL migration guide about encoding auto-detection of CSV files
> --------------------------------------------------------------------------
>
> Key: SPARK-36089
> URL: https://issues.apache.org/jira/browse/SPARK-36089
> Project: Spark
> Issue Type: Documentation
> Components: SQL
> Affects Versions: 3.2.0, 3.1.3, 3.0.4
> Reporter: Max Gekk
> Assignee: Max Gekk
> Priority: Major
>
> Need to update the SQL migration guide to inform users about behavior change.
> *What*: Spark doesn't detect encoding (charset) in CSV files with BOM
> correctly. Such files can be read only in the multiLine mode when the CSV
> option encoding matches to the actual encoding of CSV files. For example,
> Spark cannot read UTF-16BE CSV files when encoding is set to UTF-8 which is
> the default mode. This is the case of the current ES ticket.
> *Why*: In previous Spark versions, encoding wasn't propagated to the
> underlying library that means the lib tried to detect file encoding
> automatically. It could success for some encodings that require BOM presents
> at the beginning of files. Starting from the versions 3.0, users can specify
> file encoding via the CSV option encoding which has UTF-8 as the default
> value. Spark propagates such default to the underlying library (uniVocity),
> and as a consequence this turned off encoding autodetection.
> *When*: Since Spark 3.0. In particular, the commit
> https://github.com/apache/spark/commit/2df34db586bec379e40b5cf30021f5b7a2d79271
> causes the issue.
> *Workaround*: Enabling the encoding auto-detection mechanism in uniVocity by
> passing null as the value of CSV option encoding. A more recommended approach
> is to set the encoding option to UTF-16 explicitly.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]