[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

MaxGekk Sat, 31 Mar 2018 02:36:04 -0700

Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/20849
  
    @HyukjinKwon I did an experiment on the 
https://github.com/MaxGekk/spark-1/pull/2 and modified [the 
test](https://github.com/MaxGekk/spark-1/blob/f94d846b39ade89da24ef3e85f9721fb34e48154/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala#L2072-L2083):
 
    If `UTF-16LE` is set explicitly:
    ```
    val jsonDF = spark.read.schema(schema)
          .option("lineSep", "x0d 00 0a 00")
          .option("encoding", "UTF-16LE")
          .json(testFile(fileName))
    ```
    only second line is returned correctly:
    ```
    +---------+--------+
    |firstName|lastName|
    +---------+--------+
    |     null|    null|
    |     Doug|    Rood|
    +---------+--------+
    ```
    In the case of `UTF-16`, the first row is returned from the CSV file:
    ```
    +---------+--------+
    |firstName|lastName|
    +---------+--------+
    |    Chris|   Baird|
    |     null|    null|
    +---------+--------+
    ```
    
    And you are right in the case if encoding is `UTF-16`, BOM is added to the 
delimiter:
    
    ```
    val jsonDF = spark.read.schema(schema)
          .option("lineSep", "\r\n")
          .option("encoding", "UTF-16")
          .json(testFile(fileName))
    ```
    The `lineSeparator` parameter of `HadoopFileLinesReader` is `0xFE 0xFF 0x00 
0x0D 0x00 0x0A` - BOM + UTF-16BE (in the CSV file BOM+UTF-16LE). Even if we cut 
BOM from lineSep, it will still not correct.
    
    So, there are 2 (or 3) problems actually.
    
    Just in case:
    
    ```
       val jsonDF = spark.read.schema(schema)
          .option("lineSep", "\r\n")
          .option("encoding", "UTF-16LE")
          .json(testFile(fileName))
    ```
    ```
    +---------+--------+
    |firstName|lastName|
    +---------+--------+
    |     null|    null|
    |     Doug|    Rood|
    +---------+--------+
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

Reply via email to