[
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364866#comment-16364866
]
Bruce Robbins commented on SPARK-23410:
---------------------------------------
[~maxgekk]
My simple test input of
[{"field1": 10, "field2": "hello"},{"field1": 12, "field2": "byte"}]
is encoded like this (according to emacs hexl-mode):
{noformat}
00000000: feff 005b 007b 0022 0066 0069 0065 006c ...[.{.".f.i.e.l
00000010: 0064 0031 0022 003a 0020 0031 0030 002c .d.1.".:. .1.0.,
00000020: 0020 0022 0066 0069 0065 006c 0064 0032 . .".f.i.e.l.d.2
00000030: 0022 003a 0020 0022 0068 0065 006c 006c .".:. .".h.e.l.l
00000040: 006f 0022 007d 002c 007b 0022 0066 0069 .o.".}.,.{.".f.i
00000050: 0065 006c 0064 0031 0022 003a 0020 0031 .e.l.d.1.".:. .1
00000060: 0032 002c 0020 0022 0066 0069 0065 006c .2.,. .".f.i.e.l
00000070: 0064 0032 0022 003a 0020 0022 0062 0079 .d.2.".:. .".b.y
00000080: 0074 0065 0022 007d 005d 000a .t.e.".}.]..
{noformat}
I just used iconv to convert the file from utf-8 to utf-16.
> Unable to read jsons in charset different from UTF-8
> ----------------------------------------------------
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
> Issue Type: Bug
> Components: Input/Output
> Affects Versions: 2.3.0
> Reporter: Maxim Gekk
> Priority: Major
>
> Currently the Json Parser is forced to read json files in UTF-8. Such
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions
> that can read json files in UTF-16, UTF-32 and other encodings due to using
> of the auto detection mechanism of the jackson library. Need to give back to
> users possibility to read json files in specified charset and/or detect
> charset automatically as it was before.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]