[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...

MaxGekk Mon, 30 Jul 2018 02:00:58 -0700

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21909#discussion_r206059735
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
    @@ -450,7 +450,8 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
             input => rawParser.parse(input, createParser, 
UTF8String.fromString),
             parsedOptions.parseMode,
             schema,
    -        parsedOptions.columnNameOfCorruptRecord)
    +        parsedOptions.columnNameOfCorruptRecord,
    +        optimizeEmptySchema = true)
    --- End diff --
    
    > what would be the case to turn it off?
    
    We can apply the optimization if we know in advance that one JSON object 
corresponds to one struct. In that case, we can return empty row if required 
schema (struct) is empty. If `multiLine` is enabled, there could be many 
structs per a JSON document. So, we cannot say in advance how many empty rows 
need to return without parsing.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...

Reply via email to