[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...

MaxGekk Thu, 16 Aug 2018 11:53:25 -0700

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21909#discussion_r210704902
  
    --- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
    @@ -2223,21 +2223,31 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
         checkAnswer(jsonDF, Seq(Row("Chris", "Baird")))
       }
     
    -
       test("SPARK-23723: specified encoding is not matched to actual 
encoding") {
    -    val fileName = "test-data/utf16LE.json"
    -    val schema = new StructType().add("firstName", 
StringType).add("lastName", StringType)
    -    val exception = intercept[SparkException] {
    -      spark.read.schema(schema)
    -        .option("mode", "FAILFAST")
    -        .option("multiline", "true")
    -        .options(Map("encoding" -> "UTF-16BE"))
    -        .json(testFile(fileName))
    -        .count()
    +    def doCount(bypassParser: Boolean, multiLine: Boolean): Long = {
    +      var result: Long = -1
    +      withSQLConf(SQLConf.BYPASS_PARSER_FOR_EMPTY_SCHEMA.key -> 
bypassParser.toString) {
    +        val fileName = "test-data/utf16LE.json"
    +        val schema = new StructType().add("firstName", 
StringType).add("lastName", StringType)
    +        result = spark.read.schema(schema)
    +          .option("mode", "FAILFAST")
    --- End diff --
    
    > Does the mode matter?
    
    I just want to have an explicit error in the test instead of `0` for 
`count()` ( `DROPMALFORMED`), or full table of nulls or an exception 
(`PERMISSIVE`) since an exception is expected result.
    
    > What happened if users use DROPMALFORMED before this PR?
    
    It depends on `multiLine`. If it is `true`, behaviour before and after PR 
is the same since the optimization doesn't impact on the `multiLine` mode. For 
`multiLine` equals to `false`, after the PR the result is `5` (total number of 
lines), before the PR - `0` in the `DROPMALFORMED` mode.
    
    We can enable this optimization for the `PERMISSIVE` mode only to exclude 
any deviation in outputs.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...

Reply via email to