Github user MaxGekk commented on a diff in the pull request:
https://github.com/apache/spark/pull/20849#discussion_r175282421
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
---
@@ -2063,4 +2063,178 @@ class JsonSuite extends QueryTest with
SharedSQLContext with TestJsonData {
)
}
}
+
+ def testFile(fileName: String): String = {
+
Thread.currentThread().getContextClassLoader.getResource(fileName).toString
+ }
+
+ test("json in UTF-16 with BOM") {
+ val fileName = "json-tests/utf16WithBOM.json"
+ val schema = new StructType().add("firstName",
StringType).add("lastName", StringType)
+ val jsonDF = spark.read.schema(schema)
+ // The mode filters null rows produced because new line delimiter
+ // for UTF-8 is used by default.
--- End diff --
We declare that we are able to read JSON. According to the rfc7159 (8.1
Character Encoding):
```
JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default
encoding is UTF-8, and JSON texts that are encoded in UTF-8 are
interoperable in the sense that they will be read successfully by the
maximum number of implementations; there are many implementations
that cannot successfully read texts in other encodings (such as
UTF-16 and UTF-32).
```
Users can think that Spark can read json in charset different from UTF-8
because it SHALL do that according to the rfc, and we DON'T directly declare
that jsons such encodings cannot be read successfully.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]