[ 
https://issues.apache.org/jira/browse/DRILL-3765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001400#comment-15001400
 ] 

Jinfeng Ni edited comment on DRILL-3765 at 11/13/15 10:46 PM:
--------------------------------------------------------------

Did some preliminary testing to see how much performance we may gain from the 
patch, if we move the PruneScanRules into a HepPlanner, once the project/filter 
pushdown are applied. Here is the result when run on mac.  

date: tpcds sample dataset:
1. Create a partitioned table.  This produces a table with 18000 parquet files. 
{code}
create table dfs.tmp.store_pb_item_sk partition by (ss_item_sk) as select * 
from store_sale;
{code} 

2. Query the partitioned table with filter referring the partition column 
(ss_item_sk) and non-partitioning column.
{code}
explain plan for select ss_sold_date_sk, ss_sold_time_sk, ss_item_sk, 
ss_customer_sk from dfs.tmp.store_pb_item_sk where ss_item_sk in (100, 200, 
300, 400, 500) and ss_customer_sk = 96479;
{code}  

3. Results:
{code}
alter session set `planner.enable_hep_partition_prunig` = true;

explain plan for select ss_sold_date_sk, ss_sold_time_sk, ss_item_sk, 
ss_customer_sk from dfs.tmp.store_pb_item_sk where ss_item_sk in (100, 200, 
300, 400, 500) and ss_customer_sk = 96479;

1 row selected (5.246 seconds)

alter session set `planner.enable_hep_partition_pruning` = false;
explain plan for select ss_sold_date_sk, ss_sold_time_sk, ss_item_sk, 
ss_customer_sk from dfs.tmp.store_pb_item_sk where ss_item_sk in (100, 200, 
300, 400, 500) and ss_customer_sk = 96479;

+------+------+
1 row selected (9.412 seconds)
{code}

By avoiding the repeated PruneScanRule executions, the planning time is reduced 
from 9.4 seconds to 5.2 seconds.  With more parquet files in the table or 
multiple table join query, it would expected that we might see even big 
improvements with this patch.

With parquet metadata cache file created, I saw similar number between the 
existing number and the new number. 

Log shows that the existing code indeed would fire the PruneScanRules multiple 
times, including the directory-based pruning and partitioning column (from 
CTAS) based pruning. With the patch, partition pruning will be fired once for 
directory-based pruning and once for partitioning column pruning. That explains 
the performance gain we saw in this preliminary test.





was (Author: jni):
Did some preliminary testing to see how much performance we may gain from the 
patch, if we move the PruneScanRules into a HepPlanner, once the project/filter 
pushdown are applied. Here is the result when run on mac.  

date: tpcds sample dataset:
1. Create a partitioned table.  This produces a table with 18000 parquet files. 
{code}
create table dfs.tmp.store_pb_item_sk partition by (ss_item_sk) as select * 
from store_sale;
{code} 

2. Query the partitioned table with filter referring the partition column 
(ss_item_sk) and non-partitioning column.
{code}
explain plan for select ss_sold_date_sk, ss_sold_time_sk, ss_item_sk, 
ss_customer_sk from dfs.tmp.store_pb_item_sk where ss_item_sk in (100, 200, 
300, 400, 500) and ss_customer_sk = 96479;
{code}  

3. Results:
{code}
alter session set `planner.enable_hep_opt` = true;

explain plan for select ss_sold_date_sk, ss_sold_time_sk, ss_item_sk, 
ss_customer_sk from dfs.tmp.store_pb_item_sk where ss_item_sk in (100, 200, 
300, 400, 500) and ss_customer_sk = 96479;

1 row selected (5.246 seconds)

alter session set `planner.enable_hep_opt` = false;
explain plan for select ss_sold_date_sk, ss_sold_time_sk, ss_item_sk, 
ss_customer_sk from dfs.tmp.store_pb_item_sk where ss_item_sk in (100, 200, 
300, 400, 500) and ss_customer_sk = 96479;

+------+------+
1 row selected (9.412 seconds)
{code}

By avoiding the repeated PruneScanRule executions, the planning time is reduced 
from 9.4 seconds to 5.2 seconds.  With more parquet files in the table or 
multiple table join query, it would expected that we might see even big 
improvements with this patch.

With parquet metadata cache file created, I saw similar number between the 
existing number and the new number. 

Log shows that the existing code indeed would fire the PruneScanRules multiple 
times, including the directory-based pruning and partitioning column (from 
CTAS) based pruning. With the patch, partition pruning will be fired once for 
directory-based pruning and once for partitioning column pruning. That explains 
the performance gain we saw in this preliminary test.




> Partition prune rule is unnecessary fired multiple times. 
> ----------------------------------------------------------
>
>                 Key: DRILL-3765
>                 URL: https://issues.apache.org/jira/browse/DRILL-3765
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>            Reporter: Jinfeng Ni
>            Assignee: Jinfeng Ni
>
> It seems that the partition prune rule may be fired multiple times, even 
> after the first rule execution has pushed the filter into the scan operator. 
> Since partition prune has to build the vectors to contain the partition /file 
> / directory information, to invoke the partition prune rule unnecessary may 
> lead to big memory overhead.
> Drill planner should avoid the un-necessary partition prune rule, in order to 
> reduce the chance of hitting OOM exception, while the partition prune rule is 
> executed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to