[GitHub] spark pull request #20849: [SPARK-23723] New charset option for json datasou...

MaxGekk Sun, 18 Mar 2018 02:25:04 -0700

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20849#discussion_r175282421
  
    --- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
    @@ -2063,4 +2063,178 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
           )
         }
       }
    +
    +  def testFile(fileName: String): String = {
    +    
Thread.currentThread().getContextClassLoader.getResource(fileName).toString
    +  }
    +
    +  test("json in UTF-16 with BOM") {
    +    val fileName = "json-tests/utf16WithBOM.json"
    +    val schema = new StructType().add("firstName", 
StringType).add("lastName", StringType)
    +    val jsonDF = spark.read.schema(schema)
    +      // The mode filters null rows produced because new line delimiter
    +      // for UTF-8 is used by default.
    --- End diff --
    
    We declare that we are able to read JSON. According to the rfc7159 (8.1 
Character Encoding):
    
    ```
       JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32.  The default
       encoding is UTF-8, and JSON texts that are encoded in UTF-8 are
       interoperable in the sense that they will be read successfully by the
       maximum number of implementations; there are many implementations
       that cannot successfully read texts in other encodings (such as
       UTF-16 and UTF-32).
    ```
    
    Users can think that Spark can read json in charset different from UTF-8 
because it SHALL do that according to the rfc, and we DON'T directly declare 
that jsons such encodings cannot be read successfully.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #20849: [SPARK-23723] New charset option for json datasou...

Reply via email to