[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

MaxGekk Fri, 30 Mar 2018 04:00:03 -0700

Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/20849
  
    @cloud-fan It is regular file in UTF-16 with BOM=`0xFF 0xFE` which 
indicates endianness - little-endian. When we slice the file by lines, the 
first line is still in UTF-16 with BOM, the rest lines become UTF-16LE. To read 
the lines using the same settings for jackson, I used charset auto-detection 
mechanism of the jackson library. [To do so I didn't specify any 
charset](https://github.com/MaxGekk/spark-1/blob/54fd42b64e0715540010c4d59b8b4f7a4a1b0876/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala#L2074-L2076)
 of the input stream but after removing hexadecimal representation of `lineSep` 
I must set charset for the lineSep (`\r\n` or `\u000d\u000a`) otherwise it 
would be not possible to convert it to the array of byte needed by Hadoop 
LineReader. 
    
    In such way, if I set `UTF-16`, I am able to read only the first line but 
if I set `UTF-16LE`, the first line cannot be read because it contains BOM (a 
`UTF-16LE` string must not contain any BOMs).
    
    So, the problem is the lineSep option doesn't define actual delimiter 
required to split input text by lines. It just defines a string which requires 
a charset to convert it to real delimiter (array of bytes). The hex format 
proposed in my [first PR](https://github.com/MaxGekk/spark-1/pull/1) solves the 
problem.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

Reply via email to