[
https://issues.apache.org/jira/browse/SPARK-16186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen updated SPARK-16186:
------------------------------
Assignee: Dongjoon Hyun
> Support partition batch pruning with `IN` predicate in InMemoryTableScanExec
> ----------------------------------------------------------------------------
>
> Key: SPARK-16186
> URL: https://issues.apache.org/jira/browse/SPARK-16186
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Reporter: Dongjoon Hyun
> Assignee: Dongjoon Hyun
> Fix For: 2.1.0
>
>
> One of the most frequent usage patterns for Spark SQL is using **cached
> tables**.
> This issue improves `InMemoryTableScanExec` to handle `IN` predicate
> efficiently by pruning partition batches.
> Of course, the performance improvement varies over the queries and the
> datasets. For the following simple query, the query duration in Spark UI goes
> from 9 seconds to 50~90ms. It's about over 100 times faster.
> {code}
> $ bin/spark-shell --driver-memory 6G
> scala> val df = spark.range(2000000000)
> scala> df.createOrReplaceTempView("t")
> scala> spark.catalog.cacheTable("t")
> scala> sql("select id from t where id = 1").collect() // About 2 mins
> scala> sql("select id from t where id = 1").collect() // less than 90ms
> scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds
> (Before)
> scala> sql("select id from t where id in (1,2,3)").collect() // less than
> 90ms (After)
> {code}
> This issue has impacts over 35 queries of TPC-DS if the tables are cached.
> Note that this optimization is applied for IN. To apply IN predicate having
> more than 10 items, *spark.sql.optimizer.inSetConversionThreshold* option
> should be increased.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]