[
https://issues.apache.org/jira/browse/SPARK-16360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated SPARK-16360:
----------------------------------
Description:
Currently, there are a few reports about Spark 2.0 query performance regression
for large queries.
This issue speeds up SQL query processing performance by removing redundant
consecutive `executePlan` call in `Dataset.ofRows` function and `Dataset`
instantiation. Specifically, this issue aims to reduce the overhead of SQL
query execution plan generation, not real query execution. So, we can not see
the result in the Spark Web UI. Please use the following query script.
**Before**
{code}
scala> :pa
// Entering paste mode (ctrl-D to finish)
val n = 4000
val values = (1 to n).map(_.toString).mkString(", ")
val columns = (1 to n).map("column" + _).mkString(", ")
val query =
s"""
|SELECT $columns
|FROM VALUES ($values) T($columns)
|WHERE 1=2 AND 1 IN ($columns)
|GROUP BY $columns
|ORDER BY $columns
|""".stripMargin
def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block
println("Elapsed time: " + ((System.nanoTime - t0) / 1e9) + "s")
result
}
time(sql(query))
time(sql(query))
// Exiting paste mode, now interpreting.
Elapsed time: 30.138142577s
Elapsed time: 25.787751452s
{code}
**After**
{code}
Elapsed time: 17.500279659s // First query has a little overhead of
initialization.
Elapsed time: 12.364812255s // This shows the real difference. The speed up is
about 2 times.
{code}
was:
Currently, there are a few reports about Spark 2.0 query performance regression
for large queries.
This issue speeds up SQL query processing performance by removing redundant
consecutive analysis in `Dataset.ofRows` function and `Dataset` instantiation.
Specifically, this issue aims to reduce the overhead of SQL query analysis, not
query execution. So, we can not see the result in the Spark Web UI. Please use
the following query script.
**Before**
{code}
scala> :pa
// Entering paste mode (ctrl-D to finish)
val n = 4000
val values = (1 to n).map(_.toString).mkString(", ")
val columns = (1 to n).map("column" + _).mkString(", ")
val query =
s"""
|SELECT $columns
|FROM VALUES ($values) T($columns)
|WHERE 1=2 AND 1 IN ($columns)
|GROUP BY $columns
|ORDER BY $columns
|""".stripMargin
def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block
println("Elapsed time: " + ((System.nanoTime - t0) / 1e9) + "s")
result
}
time(sql(query))
time(sql(query))
// Exiting paste mode, now interpreting.
Elapsed time: 30.138142577s
Elapsed time: 25.787751452s
{code}
**After**
{code}
Elapsed time: 17.500279659s // First query has a little overhead of
initialization.
Elapsed time: 12.364812255s // This shows the real difference. The speed up is
about 2 times.
{code}
Summary: Speed up SQL query performance by removing redundant
`executePlan` call in `Dataset` (was: Speed up SQL query performance by
removing redundant analysis in `Dataset`)
> Speed up SQL query performance by removing redundant `executePlan` call in
> `Dataset`
> ------------------------------------------------------------------------------------
>
> Key: SPARK-16360
> URL: https://issues.apache.org/jira/browse/SPARK-16360
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Reporter: Dongjoon Hyun
>
> Currently, there are a few reports about Spark 2.0 query performance
> regression for large queries.
> This issue speeds up SQL query processing performance by removing redundant
> consecutive `executePlan` call in `Dataset.ofRows` function and `Dataset`
> instantiation. Specifically, this issue aims to reduce the overhead of SQL
> query execution plan generation, not real query execution. So, we can not see
> the result in the Spark Web UI. Please use the following query script.
> **Before**
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> val n = 4000
> val values = (1 to n).map(_.toString).mkString(", ")
> val columns = (1 to n).map("column" + _).mkString(", ")
> val query =
> s"""
> |SELECT $columns
> |FROM VALUES ($values) T($columns)
> |WHERE 1=2 AND 1 IN ($columns)
> |GROUP BY $columns
> |ORDER BY $columns
> |""".stripMargin
> def time[R](block: => R): R = {
> val t0 = System.nanoTime()
> val result = block
> println("Elapsed time: " + ((System.nanoTime - t0) / 1e9) + "s")
> result
> }
> time(sql(query))
> time(sql(query))
> // Exiting paste mode, now interpreting.
> Elapsed time: 30.138142577s
> Elapsed time: 25.787751452s
> {code}
> **After**
> {code}
> Elapsed time: 17.500279659s // First query has a little overhead of
> initialization.
> Elapsed time: 12.364812255s // This shows the real difference. The speed up
> is about 2 times.
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]