[
https://issues.apache.org/jira/browse/SPARK-23723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiao Li updated SPARK-23723:
----------------------------
Summary: New encoding option for json datasource (was: New charset option
for json datasource)
> New encoding option for json datasource
> ---------------------------------------
>
> Key: SPARK-23723
> URL: https://issues.apache.org/jira/browse/SPARK-23723
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 2.4.0
> Reporter: Maxim Gekk
> Assignee: Maxim Gekk
> Priority: Major
> Fix For: 2.4.0
>
>
> Currently JSON Reader can read json files in different charset/encodings. The
> JSON Reader uses the jackson-json library to automatically detect the charset
> of input text/stream. Here you can see the method which detects encoding:
> [https://github.com/FasterXML/jackson-core/blob/master/src/main/java/com/fasterxml/jackson/core/json/ByteSourceJsonBootstrapper.java#L111-L174]
>
> The detectEncoding method checks the BOM
> ([https://en.wikipedia.org/wiki/Byte_order_mark]) at the beginning of a text.
> The BOM can be in the file but it is not mandatory. If it is not present, the
> auto detection mechanism can select wrong charset. And as a consequence of
> that, the user cannot read the json file. *The proposed option will allow to
> bypass the auto detection mechanism and set the charset explicitly.*
>
> The charset option is already exposed as a CSV option:
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L87-L88]
> . I propose to add the same option for JSON.
>
> Regarding to JSON Writer, *the charset option will give to the user
> opportunity* to read json files in charset different from UTF-8, modify the
> dataset and *write results back to json files in the original encoding.* At
> the moment it is not possible to do because the result can be saved in UTF-8
> only.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]