[ 
https://issues.apache.org/jira/browse/SPARK-16360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16360:
----------------------------------
    Description: 
Currently, there are a few reports about Spark 2.0 query performance regression 
for large queries.

This issue speeds up SQL query processing performance by removing redundant 
consecutive `executePlan` call in `Dataset.ofRows` function and `Dataset` 
instantiation. Specifically, this issue aims to reduce the overhead of SQL 
query execution plan generation, not real query execution. So, we can not see 
the result in the Spark Web UI. Please use the following query script.

**Before**
{code}
scala> :pa
// Entering paste mode (ctrl-D to finish)

val n = 4000
val values = (1 to n).map(_.toString).mkString(", ")
val columns = (1 to n).map("column" + _).mkString(", ")
val query =
  s"""
     |SELECT $columns
     |FROM VALUES ($values) T($columns)
     |WHERE 1=2 AND 1 IN ($columns)
     |GROUP BY $columns
     |ORDER BY $columns
     |""".stripMargin

def time[R](block: => R): R = {
  val t0 = System.nanoTime()
  val result = block
  println("Elapsed time: " + ((System.nanoTime - t0) / 1e9) + "s")
  result
}

time(sql(query))
time(sql(query))

// Exiting paste mode, now interpreting.

Elapsed time: 30.138142577s
Elapsed time: 25.787751452s
{code}

**After**
{code}
Elapsed time: 17.500279659s  // First query has a little overhead of 
initialization.
Elapsed time: 12.364812255s  // This shows the real difference. The speed up is 
about 2 times.
{code}



  was:
Currently, there are a few reports about Spark 2.0 query performance regression 
for large queries.

This issue speeds up SQL query processing performance by removing redundant 
consecutive analysis in `Dataset.ofRows` function and `Dataset` instantiation. 
Specifically, this issue aims to reduce the overhead of SQL query analysis, not 
query execution. So, we can not see the result in the Spark Web UI. Please use 
the following query script.

**Before**
{code}
scala> :pa
// Entering paste mode (ctrl-D to finish)

val n = 4000
val values = (1 to n).map(_.toString).mkString(", ")
val columns = (1 to n).map("column" + _).mkString(", ")
val query =
  s"""
     |SELECT $columns
     |FROM VALUES ($values) T($columns)
     |WHERE 1=2 AND 1 IN ($columns)
     |GROUP BY $columns
     |ORDER BY $columns
     |""".stripMargin

def time[R](block: => R): R = {
  val t0 = System.nanoTime()
  val result = block
  println("Elapsed time: " + ((System.nanoTime - t0) / 1e9) + "s")
  result
}

time(sql(query))
time(sql(query))

// Exiting paste mode, now interpreting.

Elapsed time: 30.138142577s
Elapsed time: 25.787751452s
{code}

**After**
{code}
Elapsed time: 17.500279659s  // First query has a little overhead of 
initialization.
Elapsed time: 12.364812255s  // This shows the real difference. The speed up is 
about 2 times.
{code}



        Summary: Speed up SQL query performance by removing redundant 
`executePlan` call in `Dataset`  (was: Speed up SQL query performance by 
removing redundant analysis in `Dataset`)

> Speed up SQL query performance by removing redundant `executePlan` call in 
> `Dataset`
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-16360
>                 URL: https://issues.apache.org/jira/browse/SPARK-16360
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Dongjoon Hyun
>
> Currently, there are a few reports about Spark 2.0 query performance 
> regression for large queries.
> This issue speeds up SQL query processing performance by removing redundant 
> consecutive `executePlan` call in `Dataset.ofRows` function and `Dataset` 
> instantiation. Specifically, this issue aims to reduce the overhead of SQL 
> query execution plan generation, not real query execution. So, we can not see 
> the result in the Spark Web UI. Please use the following query script.
> **Before**
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> val n = 4000
> val values = (1 to n).map(_.toString).mkString(", ")
> val columns = (1 to n).map("column" + _).mkString(", ")
> val query =
>   s"""
>      |SELECT $columns
>      |FROM VALUES ($values) T($columns)
>      |WHERE 1=2 AND 1 IN ($columns)
>      |GROUP BY $columns
>      |ORDER BY $columns
>      |""".stripMargin
> def time[R](block: => R): R = {
>   val t0 = System.nanoTime()
>   val result = block
>   println("Elapsed time: " + ((System.nanoTime - t0) / 1e9) + "s")
>   result
> }
> time(sql(query))
> time(sql(query))
> // Exiting paste mode, now interpreting.
> Elapsed time: 30.138142577s
> Elapsed time: 25.787751452s
> {code}
> **After**
> {code}
> Elapsed time: 17.500279659s  // First query has a little overhead of 
> initialization.
> Elapsed time: 12.364812255s  // This shows the real difference. The speed up 
> is about 2 times.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to