Bogdan Raducanu created SPARK-25209:
---------------------------------------
Summary: Optimization in Dataset.apply for DataFrames
Key: SPARK-25209
URL: https://issues.apache.org/jira/browse/SPARK-25209
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.3.1
Reporter: Bogdan Raducanu
{{Dataset.apply}} calls {{dataset.deserializer}} (to provide an early error)
which ends up calling the full {{Analyzer}} on the deserializer. This can take
tens of milliseconds, depending on how big the plan is.
Since {{Dataset.apply}} is called for many {{Dataset}} operations such as
{{Dataset.where}} it can be a significant overhead for short queries.
In the following code: {{duration}} is *17 ms* in current spark *vs 1 ms*
if I remove the line {{dataset.deserializer}}.
It seems the resulting {{deserializer}} is particularly big in the case of
nested schema, but the same overhead can be observed if we have a very wide
flat schema.
According to a comment in the PR that introduced this check, we can at least
remove this check for {{DataFrames}}:
[https://github.com/apache/spark/pull/20402#discussion_r164338267]
{code}
val col = "named_struct(" +
(0 until 100).map { i => s"'col$i', id"}.mkString(",") + ")"
val df = spark.range(10).selectExpr(col)
val TRUE = lit(true)
val numIter = 1000
var startTime = System.nanoTime()
for(i <- 0 until numIter) {
df.where(TRUE)
}
val durationMs = (System.nanoTime() - startTime) / numIter / 1000000
println(s"duration $durationMs")
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]