[ 
https://issues.apache.org/jira/browse/SPARK-27227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-27227:
-----------------------------
    Summary: Spark Runtime Filter  (was: Dynamic Partition Prune in Spark)

> Spark Runtime Filter
> --------------------
>
>                 Key: SPARK-27227
>                 URL: https://issues.apache.org/jira/browse/SPARK-27227
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Song Jun
>            Priority: Major
>
> When we equi-join one big table with a smaller table, we can collect some 
> statistics from the smaller table side, and use it to the scan of big table 
> to do partition prune or data filter before execute the join.
> This can significantly improve SQL perfermance.
> For a simple example:
> select * from A, B where A.a = B.b
> A is big table ,B is small table.
> There are two scenarios:
> 1. A.a is a partition column of table A
>    we can collect  all the values  of B.b, and send it to table A to do 
>    partition prune on A.a.
> 2. A.a is not a partition column of table A
>   we can collect real-time some statistics(such as min/max/bloomfilter) of 
> B.b by execute extra sql(select max(b),min(b),bbf(b) from B), and send it to 
> table A to do filter on A.a.
>   Addititionaly, if a more complex query select * from A join (select * from 
> B where B.c = 1) X on A.a = B.b, then we collect real-time statistics(such as 
> min/max/bloomfilter) of X by execute extra sql(select max(b),min(b),bbf(b) 
> from X)
> Above two scenarios, we can filter out lots of data by partition prune or 
> data filter, thus we can imporve perfermance.
> 10TB TPC-DS  gain about 35%  improvement in our test.
> I will submit a SPIP later.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to