Bogdan Raducanu created SPARK-25209:
---------------------------------------

             Summary: Optimization in Dataset.apply for DataFrames
                 Key: SPARK-25209
                 URL: https://issues.apache.org/jira/browse/SPARK-25209
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.3.1
            Reporter: Bogdan Raducanu


{{Dataset.apply}} calls {{dataset.deserializer}} (to provide an early error) 
which ends up calling the full {{Analyzer}} on the deserializer. This can take 
tens of milliseconds, depending on how big the plan is.
 Since {{Dataset.apply}} is called for many {{Dataset}} operations such as 
{{Dataset.where}} it can be a significant overhead for short queries.

In the following code: {{duration}} is *17 ms* in current spark *vs 1 ms* 
 if I remove the line {{dataset.deserializer}}.
It seems the resulting {{deserializer}} is particularly big in the case of 
nested schema, but the same overhead can be observed if we have a very wide 
flat schema.
 According to a comment in the PR that introduced this check, we can at least 
remove this check for {{DataFrames}}: 
[https://github.com/apache/spark/pull/20402#discussion_r164338267]
{code}
    val col = "named_struct(" +
      (0 until 100).map { i => s"'col$i', id"}.mkString(",") + ")"
    val df = spark.range(10).selectExpr(col)
    val TRUE = lit(true)
    val numIter = 1000
    var startTime = System.nanoTime()
    for(i <- 0 until numIter) {
      df.where(TRUE)
    }
    val durationMs = (System.nanoTime() - startTime) / numIter / 1000000
    println(s"duration $durationMs")
 {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to