Dongjoon Hyun created SPARK-16186:
-------------------------------------

             Summary: Support partition batch pruning with `IN` predicate in 
InMemoryTableScanExec
                 Key: SPARK-16186
                 URL: https://issues.apache.org/jira/browse/SPARK-16186
             Project: Spark
          Issue Type: Improvement
          Components: SQL
            Reporter: Dongjoon Hyun


One of the most frequent usage patterns for Spark SQL is using **cached 
tables**.

This issue improves `InMemoryTableScanExec` to handle `IN` predicate 
efficiently by pruning partition batches.

Of course, the performance improvement varies over the queries and the 
datasets. For the following simple query, the query duration in Spark UI goes 
from 9 seconds to 50~90ms. It's about 100 times faster.
{code}
$ bin/spark-shell --driver-memory 6G
scala> val df = spark.range(2000000000)
scala> df.createOrReplaceTempView("t")
scala> spark.catalog.cacheTable("t")
scala> sql("select id from t where id = 1").collect()    // About 2 mins
scala> sql("select id from t where id = 1").collect()    // less than 90ms
scala> sql("select id from t where id in (1,2,3)").collect()  // 9 seconds
scala> 
spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 
10)  // Enable. (Just to show this examples, currently the default value is 10.)
scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms
spark.conf.set("spark.sql.inMemoryColumnarStorage.partitionPruningMaxInSize", 
0)  // Disable
scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms
{code}

This issue has impacts over 35 queries of TPC-DS if the tables are cached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to