Max Gekk created SPARK-36089:
--------------------------------
Summary: Update the SQL migration guide about encoding
auto-detection o CSV files
Key: SPARK-36089
URL: https://issues.apache.org/jira/browse/SPARK-36089
Project: Spark
Issue Type: Documentation
Components: SQL
Affects Versions: 3.2.0
Reporter: Max Gekk
Assignee: Max Gekk
Need to update the SQL migration guide to inform users about behavior change.
*What*: Spark doesn't detect encoding (charset) in CSV files with BOM
correctly. Such files can be read only in the multiLine mode when the CSV
option encoding matches to the actual encoding of CSV files. For example, Spark
cannot read UTF-16BE CSV files when encoding is set to UTF-8 which is the
default mode. This is the case of the current ES ticket.
*Why*: In previous Spark versions, encoding wasn't propagated to the underlying
library that means the lib tried to detect file encoding automatically. It
could success for some encodings that require BOM presents at the beginning of
files. Starting from the versions 3.0, users can specify file encoding via the
CSV option encoding which has UTF-8 as the default value. Spark propagates such
default to the underlying library (uniVocity), and as a consequence this turned
off encoding autodetection.
*When*: Since Spark 3.0. In particular, the commit
https://github.com/apache/spark/commit/2df34db586bec379e40b5cf30021f5b7a2d79271
causes the issue.
*Workaround*: Enabling the encoding auto-detection mechanism in uniVocity by
passing null as the value of CSV option encoding. A more recommended approach
is to set the encoding option to UTF-16 explicitly.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]