[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

Bruce Robbins (JIRA) Wed, 14 Feb 2018 14:20:26 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364866#comment-16364866
 ]


Bruce Robbins commented on SPARK-23410:
---------------------------------------

[~maxgekk]

My simple test input of

[{"field1": 10, "field2": "hello"},{"field1": 12, "field2": "byte"}]

is encoded like this (according to emacs hexl-mode):
{noformat}
00000000: feff 005b 007b 0022 0066 0069 0065 006c  ...[.{.".f.i.e.l
00000010: 0064 0031 0022 003a 0020 0031 0030 002c  .d.1.".:. .1.0.,
00000020: 0020 0022 0066 0069 0065 006c 0064 0032  . .".f.i.e.l.d.2
00000030: 0022 003a 0020 0022 0068 0065 006c 006c  .".:. .".h.e.l.l
00000040: 006f 0022 007d 002c 007b 0022 0066 0069  .o.".}.,.{.".f.i
00000050: 0065 006c 0064 0031 0022 003a 0020 0031  .e.l.d.1.".:. .1
00000060: 0032 002c 0020 0022 0066 0069 0065 006c  .2.,. .".f.i.e.l
00000070: 0064 0032 0022 003a 0020 0022 0062 0079  .d.2.".:. .".b.y
00000080: 0074 0065 0022 007d 005d 000a            .t.e.".}.]..
{noformat}
 I just used iconv to convert the file from utf-8 to utf-16.

 

> Unable to read jsons in charset different from UTF-8
> ----------------------------------------------------
>
>                 Key: SPARK-23410
>                 URL: https://issues.apache.org/jira/browse/SPARK-23410
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.3.0
>            Reporter: Maxim Gekk
>            Priority: Major
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

Reply via email to