GitHub user MaxGekk opened a pull request:

    https://github.com/apache/spark/pull/20937

    [SPARK-23723][SPARK-23724][SQL] Support custom encoding for json files

    ## What changes were proposed in this pull request?
    
    I propose new option for JSON datasource which allows to specify encoding 
(charset) of input and output files. Here is an example of using of the option:
    
    ```
    spark.read.schema(schema)
      .option("multiline", "true")
      .option("encoding", "UTF-16LE")
      .json(fileName)
    ```
    
    If the option is not specified, charset auto-detection mechanism is used by 
default.
    
    The option can be used for saving datasets to jsons. Currently Spark is 
able to save datasets into json files in UTF-8 charset only. The changes allow 
to save data in any supported charset. Here is the approximate list of 
supported charsets by Oracle Java SE: 
https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html . 
An user can specify the charset of output jsons via the charset option like 
`.option("charset", "UTF-16")`. By default the output charset is still UTF-8 to 
keep backward compatibility.
    
    The solution has the following restrictions for per-line mode (`multiline = 
false`):
    
    - If charset is different from UTF-8, the lineSep option must be specified. 
The option required because Hadoop LineReader cannot detect the line separator 
correctly. Here is the ticket for solving the issue: 
https://issues.apache.org/jira/browse/SPARK-23725
    
    - Json files started from 
[BOM](https://en.wikipedia.org/wiki/Byte_order_mark) cannot be read properly. A 
possible solution is a flexible format for `lineSep` which allows to specify 
line separator as sequence of bytes independently from encoding. A pull request 
for that will be prepared soon.   
    
    ## How was this patch tested?
    
    I added the following tests:
    - reads an json file in UTF-16 charset with BOM
    - read json file by using charset auto detection (UTF-32BE with BOM) 
    - read json file using of user's charset (UTF-16LE)
    - saving in UTF-32BE and read the result by standard library (not by Spark)
    - checking that default charset is UTF-8
    - handling wrong (unsupported) charset


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MaxGekk/spark-1 json-encoding-line-sep

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20937.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20937
    
----
commit b2e92b4706c5ed3b141805933f29beb87e1b7371
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-11T20:06:53Z

    Test for reading json in UTF-16 with BOM

commit cb2f27ba73cb5838e2910c31ca204100bb4eebca
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-11T20:48:35Z

    Use user's charset or autodetect it if the charset is not specified

commit 0d45fd382bb90ebd7161d57a3da23820b4497f67
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-13T09:48:08Z

    Added a type and a comment for charset

commit 1fb9b321a4fac0f41cfb9dd5f85b61feb6796227
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-13T10:00:27Z

    Replacing the monadic chaining by matching because it is more readable

commit c3b04ee68338ad4f93a5361a41db28b37f020907
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-13T10:44:19Z

    Keeping the old method for backward compatibility

commit 93d38794dd261ee1bbe2497470ee43de1186ef3c
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-13T10:54:52Z

    testFile is moved into the test to make more local because it is used only 
in the test

commit 15798a1ce61df29e9a32f960e755495e3d63f4e3
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-13T11:15:25Z

    Adding the charset as third parameter to the text method

commit cc05ce9af7c9f1d14bd10c1f46a60ce043c13fe1
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-13T11:29:57Z

    Removing whitespaces at the end of the line

commit 74f2026e62389902ab7a4c418aa96a492fa14f6f
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-13T12:29:28Z

    Fix the comment in javadoc style

commit 4856b8e0b287b3ba3331865298f0603dde18459c
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-13T12:32:48Z

    Simplifying of the UTF-16 test

commit 084f41fb6edd7c86aeb8643973119cb4b38a34fa
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-15T17:33:25Z

    A hint to the exception how to set the charset explicitly

commit 31cd793a86e6a0e48e0150ffb8c36da2872c65ca
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-15T18:00:55Z

    Fix for scala style checks

commit 6eacd186a954a3f724ee607826b17f432ead77e1
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-15T18:44:04Z

    Run tests again

commit 3b4a509d0260cfab720a5471ccd937de55c56093
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-15T19:06:06Z

    Improving of the exception message

commit cd1124ef7e6329f4dcd6926064271cd24b5a150d
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-15T19:41:35Z

    Appended the original message to the exception

commit ebf53904151582eef6d95780ca30b773404ae141
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-17T20:36:28Z

    Multi-line reading of json file in utf-32

commit c5b6a35d08dabedaca2dab6eedfd2a13bdb62e5a
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-17T21:43:23Z

    Autodetect charset of jsons in the multiline mode

commit ef5e6c6ec607239864375053a6e921acd3deae96
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-17T21:54:21Z

    Test for reading a json in UTF-16LE in the multiline mode by using user's 
charset

commit f9b6ad141c7a1b9668fe0a2e4bdf6bdbdc54b98e
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-18T03:08:37Z

    Fix test: rename the test file - utf32be -> utf32BE

commit 3b7714c8bbe31475b4797e4303ded6c59634921a
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-18T21:03:47Z

    Fix code style

commit edb9167903c9e7667f6a536f139561ed3aadb6e6
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-18T21:09:28Z

    Appending the create verb to the method for readability

commit 5ba2881c252f40f7c736232cd01c1421ba4b811c
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-18T21:20:04Z

    Making the createParser as a separate private method

commit 1509e103f8b86393b5442d516ee283a16b7fa7e7
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-18T21:27:01Z

    Fix code style

commit e3184b35e504ce46b82ee18babd3395b7d1fc34d
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-19T21:17:43Z

    Checks the charset option is supported

commit 87d259c7d190716a89016c85b7450d471b3481bf
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-19T21:19:02Z

    Support charset as a parameter of the json method

commit 76c1d08af25f8f4717314d6ba1409476d63b2ffd
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-19T22:16:31Z

    Test for charset different from utf-8

commit 88395b5f9973395d4ad0b90cc094726dee4f502a
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-20T12:39:34Z

    Description of the charset option of the json method

commit f2f8ae72e024f39efaed8f93da11a7ebb0ef6870
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-21T12:48:41Z

    Minor changes in comments: added . at the end of a sentence

commit b451a03f900aa76365e44fe10419cd8345feae09
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-21T19:47:58Z

    Added a test for wrong charset name

commit c13c15946b077800d6d68fb77f0f4692cc9f3a17
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-02-21T20:05:03Z

    Testing that charset in any case is acceptable

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to