Github user MaxGekk commented on a diff in the pull request:
https://github.com/apache/spark/pull/21909#discussion_r210704902
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
---
@@ -2223,21 +2223,31 @@ class JsonSuite extends QueryTest with
SharedSQLContext with TestJsonData {
checkAnswer(jsonDF, Seq(Row("Chris", "Baird")))
}
-
test("SPARK-23723: specified encoding is not matched to actual
encoding") {
- val fileName = "test-data/utf16LE.json"
- val schema = new StructType().add("firstName",
StringType).add("lastName", StringType)
- val exception = intercept[SparkException] {
- spark.read.schema(schema)
- .option("mode", "FAILFAST")
- .option("multiline", "true")
- .options(Map("encoding" -> "UTF-16BE"))
- .json(testFile(fileName))
- .count()
+ def doCount(bypassParser: Boolean, multiLine: Boolean): Long = {
+ var result: Long = -1
+ withSQLConf(SQLConf.BYPASS_PARSER_FOR_EMPTY_SCHEMA.key ->
bypassParser.toString) {
+ val fileName = "test-data/utf16LE.json"
+ val schema = new StructType().add("firstName",
StringType).add("lastName", StringType)
+ result = spark.read.schema(schema)
+ .option("mode", "FAILFAST")
--- End diff --
> Does the mode matter?
I just want to have an explicit error in the test instead of `0` for
`count()` ( `DROPMALFORMED`), or full table of nulls or an exception
(`PERMISSIVE`) since an exception is expected result.
> What happened if users use DROPMALFORMED before this PR?
It depends on `multiLine`. If it is `true`, behaviour before and after PR
is the same since the optimization doesn't impact on the `multiLine` mode. For
`multiLine` equals to `false`, after the PR the result is `5` (total number of
lines), before the PR - `0` in the `DROPMALFORMED` mode.
We can enable this optimization for the `PERMISSIVE` mode only to exclude
any deviation in outputs.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]