GitHub user MaxGekk opened a pull request:
https://github.com/apache/spark/pull/20937
[SPARK-23723][SPARK-23724][SQL] Support custom encoding for json files
## What changes were proposed in this pull request?
I propose new option for JSON datasource which allows to specify encoding
(charset) of input and output files. Here is an example of using of the option:
```
spark.read.schema(schema)
.option("multiline", "true")
.option("encoding", "UTF-16LE")
.json(fileName)
```
If the option is not specified, charset auto-detection mechanism is used by
default.
The option can be used for saving datasets to jsons. Currently Spark is
able to save datasets into json files in UTF-8 charset only. The changes allow
to save data in any supported charset. Here is the approximate list of
supported charsets by Oracle Java SE:
https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html .
An user can specify the charset of output jsons via the charset option like
`.option("charset", "UTF-16")`. By default the output charset is still UTF-8 to
keep backward compatibility.
The solution has the following restrictions for per-line mode (`multiline =
false`):
- If charset is different from UTF-8, the lineSep option must be specified.
The option required because Hadoop LineReader cannot detect the line separator
correctly. Here is the ticket for solving the issue:
https://issues.apache.org/jira/browse/SPARK-23725
- Json files started from
[BOM](https://en.wikipedia.org/wiki/Byte_order_mark) cannot be read properly. A
possible solution is a flexible format for `lineSep` which allows to specify
line separator as sequence of bytes independently from encoding. A pull request
for that will be prepared soon.
## How was this patch tested?
I added the following tests:
- reads an json file in UTF-16 charset with BOM
- read json file by using charset auto detection (UTF-32BE with BOM)
- read json file using of user's charset (UTF-16LE)
- saving in UTF-32BE and read the result by standard library (not by Spark)
- checking that default charset is UTF-8
- handling wrong (unsupported) charset
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/MaxGekk/spark-1 json-encoding-line-sep
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20937.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20937
----
commit b2e92b4706c5ed3b141805933f29beb87e1b7371
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-02-11T20:06:53Z
Test for reading json in UTF-16 with BOM
commit cb2f27ba73cb5838e2910c31ca204100bb4eebca
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-02-11T20:48:35Z
Use user's charset or autodetect it if the charset is not specified
commit 0d45fd382bb90ebd7161d57a3da23820b4497f67
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-02-13T09:48:08Z
Added a type and a comment for charset
commit 1fb9b321a4fac0f41cfb9dd5f85b61feb6796227
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-02-13T10:00:27Z
Replacing the monadic chaining by matching because it is more readable
commit c3b04ee68338ad4f93a5361a41db28b37f020907
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-02-13T10:44:19Z
Keeping the old method for backward compatibility
commit 93d38794dd261ee1bbe2497470ee43de1186ef3c
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-02-13T10:54:52Z
testFile is moved into the test to make more local because it is used only
in the test
commit 15798a1ce61df29e9a32f960e755495e3d63f4e3
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-02-13T11:15:25Z
Adding the charset as third parameter to the text method
commit cc05ce9af7c9f1d14bd10c1f46a60ce043c13fe1
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-02-13T11:29:57Z
Removing whitespaces at the end of the line
commit 74f2026e62389902ab7a4c418aa96a492fa14f6f
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-02-13T12:29:28Z
Fix the comment in javadoc style
commit 4856b8e0b287b3ba3331865298f0603dde18459c
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-02-13T12:32:48Z
Simplifying of the UTF-16 test
commit 084f41fb6edd7c86aeb8643973119cb4b38a34fa
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-02-15T17:33:25Z
A hint to the exception how to set the charset explicitly
commit 31cd793a86e6a0e48e0150ffb8c36da2872c65ca
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-02-15T18:00:55Z
Fix for scala style checks
commit 6eacd186a954a3f724ee607826b17f432ead77e1
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-02-15T18:44:04Z
Run tests again
commit 3b4a509d0260cfab720a5471ccd937de55c56093
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-02-15T19:06:06Z
Improving of the exception message
commit cd1124ef7e6329f4dcd6926064271cd24b5a150d
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-02-15T19:41:35Z
Appended the original message to the exception
commit ebf53904151582eef6d95780ca30b773404ae141
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-02-17T20:36:28Z
Multi-line reading of json file in utf-32
commit c5b6a35d08dabedaca2dab6eedfd2a13bdb62e5a
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-02-17T21:43:23Z
Autodetect charset of jsons in the multiline mode
commit ef5e6c6ec607239864375053a6e921acd3deae96
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-02-17T21:54:21Z
Test for reading a json in UTF-16LE in the multiline mode by using user's
charset
commit f9b6ad141c7a1b9668fe0a2e4bdf6bdbdc54b98e
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-02-18T03:08:37Z
Fix test: rename the test file - utf32be -> utf32BE
commit 3b7714c8bbe31475b4797e4303ded6c59634921a
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-02-18T21:03:47Z
Fix code style
commit edb9167903c9e7667f6a536f139561ed3aadb6e6
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-02-18T21:09:28Z
Appending the create verb to the method for readability
commit 5ba2881c252f40f7c736232cd01c1421ba4b811c
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-02-18T21:20:04Z
Making the createParser as a separate private method
commit 1509e103f8b86393b5442d516ee283a16b7fa7e7
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-02-18T21:27:01Z
Fix code style
commit e3184b35e504ce46b82ee18babd3395b7d1fc34d
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-02-19T21:17:43Z
Checks the charset option is supported
commit 87d259c7d190716a89016c85b7450d471b3481bf
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-02-19T21:19:02Z
Support charset as a parameter of the json method
commit 76c1d08af25f8f4717314d6ba1409476d63b2ffd
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-02-19T22:16:31Z
Test for charset different from utf-8
commit 88395b5f9973395d4ad0b90cc094726dee4f502a
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-02-20T12:39:34Z
Description of the charset option of the json method
commit f2f8ae72e024f39efaed8f93da11a7ebb0ef6870
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-02-21T12:48:41Z
Minor changes in comments: added . at the end of a sentence
commit b451a03f900aa76365e44fe10419cd8345feae09
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-02-21T19:47:58Z
Added a test for wrong charset name
commit c13c15946b077800d6d68fb77f0f4692cc9f3a17
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-02-21T20:05:03Z
Testing that charset in any case is acceptable
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]