[jira] [Comment Edited] (HIVE-15269) Dynamic Min-Max/BloomFilter runtime-filtering for Tez

Deepak Jaiswal (JIRA) Sun, 28 Jan 2018 20:02:25 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-15269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16342874#comment-16342874
 ]


Deepak Jaiswal edited comment on HIVE-15269 at 1/29/18 4:01 AM:
----------------------------------------------------------------

Yes it is trying to remove the dpp or runtime filter branch. The idea is to 
break the cycle by removing the least impacting branch. AM operator or the TS 
operator is on the target table and not on the side where the branch is. So if 
the target is smaller, the impact is likely smaller.

 

For a simple case for runtime filtering, lets say there are two tables A & B. A 
has 1 GB data and B has 10 GB of data. There is a branch on A which creates 
filter for B and there is a branch on B which creates a filter for A. This 
results in a cycle which needs to be broken by removing the branch which is 
least effective. In this case, we want to remove the branch on B which creates 
a filter on A as A is 10 times smaller. That is what this code does. Compares 
the size of the target tables and picks the table with smallest size as 
candidate.

 

I hope it helps.

 

Please take a look at test dynamic_semijoin_reduction.q and its result file for 
the explain plans.

 

[https://github.com/apache/hive/blob/1dd863ab0bc47115d3c89ed8058967c1496819c6/ql/src/test/queries/clientpositive/dynamic_semijoin_reduction.q]

 

[https://github.com/apache/hive/blob/1dd863ab0bc47115d3c89ed8058967c1496819c6/ql/src/test/results/clientpositive/llap/dynamic_semijoin_reduction.q.out]

 


was (Author: djaiswal):
Yes it is trying to remove the dpp or runtime filter branch. The idea is to 
break the cycle by removing the least impacting branch. AM operator or the TS 
operator is on the target table and not on the side where the branch is. So if 
the target is smaller, the impact is likely smaller.

 

For a simple case for runtime filtering, lets say there are two tables A & B. A 
has 1 GB data and B has 10 GB of data. There is a branch on A which creates 
filter for B and there is a branch on B which creates a filter for A. This 
results in a cycle which needs to be broken by removing the branch which is 
least effective. In this case, we want to remove the branch on B which creates 
a filter on A as A is 10 times smaller. That is what this code does. Compares 
the size of the target tables and picks the table with smallest size as 
candidate.

 

I hope it helps.

> Dynamic Min-Max/BloomFilter runtime-filtering for Tez
> -----------------------------------------------------
>
>                 Key: HIVE-15269
>                 URL: https://issues.apache.org/jira/browse/HIVE-15269
>             Project: Hive
>          Issue Type: New Feature
>          Components: Tez
>            Reporter: Jason Dere
>            Assignee: Deepak Jaiswal
>            Priority: Major
>              Labels: TODOC2.2.0
>             Fix For: 2.2.0
>
>         Attachments: HIVE-15269.1.patch, HIVE-15269.10.patch, 
> HIVE-15269.11.patch, HIVE-15269.12.patch, HIVE-15269.13.patch, 
> HIVE-15269.14.patch, HIVE-15269.15.patch, HIVE-15269.16.patch, 
> HIVE-15269.17.patch, HIVE-15269.18.patch, HIVE-15269.19.patch, 
> HIVE-15269.2.patch, HIVE-15269.3.patch, HIVE-15269.4.patch, 
> HIVE-15269.5.patch, HIVE-15269.6.patch, HIVE-15269.7.patch, 
> HIVE-15269.8.patch, HIVE-15269.9.patch
>
>
> If a dimension table and fact table are joined:
> {noformat}
> select *
> from store join store_sales on (store.id = store_sales.store_id)
> where store.s_store_name = 'My Store'
> {noformat}
> One optimization that can be done is to get the min/max store id values that 
> come out of the scan/filter of the store table, and send this min/max value 
> (via Tez edge) to the task which is scanning the store_sales table.
> We can add a BETWEEN(min, max) predicate to the store_sales TableScan, where 
> this predicate can be pushed down to the storage handler (for example for ORC 
> formats). Pushing a min/max predicate to the ORC reader would allow us to 
> avoid having to entire whole row groups during the table scan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HIVE-15269) Dynamic Min-Max/BloomFilter runtime-filtering for Tez

Reply via email to