[
https://issues.apache.org/jira/browse/SPARK-25209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bogdan Raducanu updated SPARK-25209:
------------------------------------
Issue Type: Improvement (was: Bug)
> Optimization in Dataset.apply for DataFrames
> --------------------------------------------
>
> Key: SPARK-25209
> URL: https://issues.apache.org/jira/browse/SPARK-25209
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.3.1
> Reporter: Bogdan Raducanu
> Priority: Major
>
> {{Dataset.apply}} calls {{dataset.deserializer}} (to provide an early error)
> which ends up calling the full {{Analyzer}} on the deserializer. This can
> take tens of milliseconds, depending on how big the plan is.
> Since {{Dataset.apply}} is called for many {{Dataset}} operations such as
> {{Dataset.where}} it can be a significant overhead for short queries.
> In the following code: {{duration}} is *17 ms* in current spark *vs 1 ms*
> if I remove the line {{dataset.deserializer}}.
> It seems the resulting {{deserializer}} is particularly big in the case of
> nested schema, but the same overhead can be observed if we have a very wide
> flat schema.
> According to a comment in the PR that introduced this check, we can at least
> remove this check for {{DataFrames}}:
> [https://github.com/apache/spark/pull/20402#discussion_r164338267]
> {code}
> val col = "named_struct(" +
> (0 until 100).map { i => s"'col$i', id"}.mkString(",") + ")"
> val df = spark.range(10).selectExpr(col)
> val TRUE = lit(true)
> val numIter = 1000
> var startTime = System.nanoTime()
> for(i <- 0 until numIter) {
> df.where(TRUE)
> }
> val durationMs = (System.nanoTime() - startTime) / numIter / 1000000
> println(s"duration $durationMs")
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]