[ 
https://issues.apache.org/jira/browse/HIVE-26008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502864#comment-17502864
 ] 

Stamatis Zampetakis commented on HIVE-26008:
--------------------------------------------

I had a quick look on the full plan of the q83 and it seems that the {{Dynamic 
Partitioning Event Operator}} are all coming from the following {{TableScan}} 
operator.
{noformat}
Map 8 
  Map Operator Tree:
      TableScan
        alias: date_dim
        filterExpr: ((d_week_seq is not null and d_date is not null) or 
((d_date) IN (DATE'1998-01-02', DATE'1998-10-15', DATE'1998-11-10') and 
d_week_seq is not null) or d_date is not null) (type: boolean)
        Statistics: Num rows: 73049 Data size: 4382940 Basic stats: COMPLETE 
Column stats: COMPLETE
{noformat}
To be more precise I think the presence of the IN predicate in the disjunction 
causes the DPP optimization to kick in but given that one side of the 
disjunction (not null predicates) is low selective we end up sending the table 
everywhere. I haven't debugged this but I have a feeling that the problem may 
come from {{SyntheticJoinPredicate}} transformation.


> Dynamic partition pruning not sending right partitions with subqueries
> ----------------------------------------------------------------------
>
>                 Key: HIVE-26008
>                 URL: https://issues.apache.org/jira/browse/HIVE-26008
>             Project: Hive
>          Issue Type: Improvement
>          Components: HiveServer2
>            Reporter: Rajesh Balamohan
>            Priority: Major
>              Labels: performance
>         Attachments: Screenshot 2022-03-08 at 5.04.02 AM.png, 
> problematic-plan.png
>
>
> DPP isn't working fine when there are subqueries involved. Here is an example 
> query (q83). 
> Note that "date_dim" has another query involved. Due to this, DPP operator 
> ends up sending entire "date_dim" to the fact tables. 
> Because of this, data scanned for fact tables are way higher and query 
> runtime is increased.
> For context, on a very small cluster, this query ran for 265 seconds and with 
> the rewritten query it finished in 11 seconds!. Fact table scan was 10MB vs 
> 10 GB.
> {noformat}
> HiveJoin(condition=[=($2, $5)], joinType=[inner])
>                 HiveJoin(condition=[=($0, $3)], joinType=[inner])
>                   HiveProject(cr_item_sk=[$1], cr_return_quantity=[$16], 
> cr_returned_date_sk=[$26])
>                     HiveFilter(condition=[AND(IS NOT NULL($26), IS NOT 
> NULL($1))])
>                       HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> catalog_returns]], table:alias=[catalog_returns])
>                   HiveProject(i_item_sk=[$0], i_item_id=[$1])
>                     HiveFilter(condition=[AND(IS NOT NULL($1), IS NOT 
> NULL($0))])
>                       HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> item]], table:alias=[item])
>                 HiveProject(d_date_sk=[$0], d_date=[$2])
>                   HiveFilter(condition=[AND(IS NOT NULL($2), IS NOT 
> NULL($0))])
>                     HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> date_dim]], table:alias=[date_dim])
>               HiveProject(d_date=[$0])
>                 HiveSemiJoin(condition=[=($1, $2)], joinType=[semi])
>                   HiveProject(d_date=[$2], d_week_seq=[$4])
>                     HiveFilter(condition=[AND(IS NOT NULL($4), IS NOT 
> NULL($2))])
>                       HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> date_dim]], table:alias=[date_dim])
>                   HiveProject(d_week_seq=[$4])
>                     HiveFilter(condition=[AND(IN($2, 1998-01-02:DATE, 
> 1998-10-15:DATE, 1998-11-10:DATE), IS NOT NULL($4))])
>                       HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> date_dim]], table:alias=[date_dim])
> {noformat}
> *Original Query & Plan: *
> {noformat}
> explain cbo with sr_items as
> (select i_item_id item_id,
> sum(sr_return_quantity) sr_item_qty
> from store_returns,
> item,
> date_dim
> where sr_item_sk = i_item_sk
> and   d_date    in
> (select d_date
> from date_dim
> where d_week_seq in
> (select d_week_seq
> from date_dim
> where d_date in ('1998-01-02','1998-10-15','1998-11-10')))
> and   sr_returned_date_sk   = d_date_sk
> group by i_item_id),
> cr_items as
> (select i_item_id item_id,
> sum(cr_return_quantity) cr_item_qty
> from catalog_returns,
> item,
> date_dim
> where cr_item_sk = i_item_sk
> and   d_date    in
> (select d_date
> from date_dim
> where d_week_seq in
> (select d_week_seq
> from date_dim
> where d_date in ('1998-01-02','1998-10-15','1998-11-10')))
> and   cr_returned_date_sk   = d_date_sk
> group by i_item_id),
> wr_items as
> (select i_item_id item_id,
> sum(wr_return_quantity) wr_item_qty
> from web_returns,
> item,
> date_dim
> where wr_item_sk = i_item_sk
> and   d_date    in
> (select d_date
> from date_dim
> where d_week_seq in
> (select d_week_seq
> from date_dim
> where d_date in ('1998-01-02','1998-10-15','1998-11-10')))
> and   wr_returned_date_sk   = d_date_sk
> group by i_item_id)
> select  sr_items.item_id
> ,sr_item_qty
> ,sr_item_qty/(sr_item_qty+cr_item_qty+wr_item_qty)/3.0 * 100 sr_dev
> ,cr_item_qty
> ,cr_item_qty/(sr_item_qty+cr_item_qty+wr_item_qty)/3.0 * 100 cr_dev
> ,wr_item_qty
> ,wr_item_qty/(sr_item_qty+cr_item_qty+wr_item_qty)/3.0 * 100 wr_dev
> ,(sr_item_qty+cr_item_qty+wr_item_qty)/3.0 average
> from sr_items
> ,cr_items
> ,wr_items
> where sr_items.item_id=cr_items.item_id
> and sr_items.item_id=wr_items.item_id
> order by sr_items.item_id
> ,sr_item_qty
> limit 100
> INFO  : Starting task [Stage-3:EXPLAIN] in serial mode
> INFO  : Completed executing 
> command(queryId=hive_20220307055109_88ad0cbd-bd40-45bc-92ae-ab15fa6b1da4); 
> Time taken: 0.973 seconds
> INFO  : OK
> Explain
> CBO PLAN:
> HiveSortLimit(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC], fetch=[100])
>   HiveProject(item_id=[$0], sr_item_qty=[$4], sr_dev=[*(/(/($5, CAST(+(+($4, 
> $1), $7)):DOUBLE), 3), 100)], cr_item_qty=[$1], cr_dev=[*(/(/($2, 
> CAST(+(+($4, $1), $7)):DOUBLE), 3), 100)], wr_item_qty=[$7], 
> wr_dev=[*(/(/($8, CAST(+(+($4, $1), $7)):DOUBLE), 3), 100)], 
> average=[/(CAST(+(+($4, $1), $7)):DECIMAL(19, 0), 3:DECIMAL(1, 0))])
>     HiveJoin(condition=[=($0, $6)], joinType=[inner])
>       HiveJoin(condition=[=($3, $0)], joinType=[inner])
>         HiveProject($f0=[$0], $f1=[$1], EXPR$0=[CAST($1):DOUBLE])
>           HiveAggregate(group=[{4}], agg#0=[sum($1)])
>             HiveSemiJoin(condition=[=($6, $7)], joinType=[semi])
>               HiveJoin(condition=[=($2, $5)], joinType=[inner])
>                 HiveJoin(condition=[=($0, $3)], joinType=[inner])
>                   HiveProject(cr_item_sk=[$1], cr_return_quantity=[$16], 
> cr_returned_date_sk=[$26])
>                     HiveFilter(condition=[AND(IS NOT NULL($26), IS NOT 
> NULL($1))])
>                       HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> catalog_returns]], table:alias=[catalog_returns])
>                   HiveProject(i_item_sk=[$0], i_item_id=[$1])
>                     HiveFilter(condition=[AND(IS NOT NULL($1), IS NOT 
> NULL($0))])
>                       HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> item]], table:alias=[item])
>                 HiveProject(d_date_sk=[$0], d_date=[$2])
>                   HiveFilter(condition=[AND(IS NOT NULL($2), IS NOT 
> NULL($0))])
>                     HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> date_dim]], table:alias=[date_dim])
>               HiveProject(d_date=[$0])
>                 HiveSemiJoin(condition=[=($1, $2)], joinType=[semi])
>                   HiveProject(d_date=[$2], d_week_seq=[$4])
>                     HiveFilter(condition=[AND(IS NOT NULL($4), IS NOT 
> NULL($2))])
>                       HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> date_dim]], table:alias=[date_dim])
>                   HiveProject(d_week_seq=[$4])
>                     HiveFilter(condition=[AND(IN($2, 1998-01-02:DATE, 
> 1998-10-15:DATE, 1998-11-10:DATE), IS NOT NULL($4))])
>                       HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> date_dim]], table:alias=[date_dim])
>         HiveProject($f0=[$0], $f1=[$1], EXPR$0=[CAST($1):DOUBLE])
>           HiveAggregate(group=[{4}], agg#0=[sum($1)])
>             HiveSemiJoin(condition=[=($6, $7)], joinType=[semi])
>               HiveJoin(condition=[=($2, $5)], joinType=[inner])
>                 HiveJoin(condition=[=($0, $3)], joinType=[inner])
>                   HiveProject(sr_item_sk=[$1], sr_return_quantity=[$9], 
> sr_returned_date_sk=[$19])
>                     HiveFilter(condition=[AND(IS NOT NULL($19), IS NOT 
> NULL($1))])
>                       HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> store_returns]], table:alias=[store_returns])
>                   HiveProject(i_item_sk=[$0], i_item_id=[$1])
>                     HiveFilter(condition=[AND(IS NOT NULL($1), IS NOT 
> NULL($0))])
>                       HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> item]], table:alias=[item])
>                 HiveProject(d_date_sk=[$0], d_date=[$2])
>                   HiveFilter(condition=[AND(IS NOT NULL($2), IS NOT 
> NULL($0))])
>                     HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> date_dim]], table:alias=[date_dim])
>               HiveProject(d_date=[$0])
>                 HiveSemiJoin(condition=[=($1, $2)], joinType=[semi])
>                   HiveProject(d_date=[$2], d_week_seq=[$4])
>                     HiveFilter(condition=[AND(IS NOT NULL($2), IS NOT 
> NULL($4))])
>                       HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> date_dim]], table:alias=[date_dim])
>                   HiveProject(d_week_seq=[$4])
>                     HiveFilter(condition=[AND(IN($2, 1998-01-02:DATE, 
> 1998-10-15:DATE, 1998-11-10:DATE), IS NOT NULL($4))])
>                       HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> date_dim]], table:alias=[date_dim])
>       HiveProject($f0=[$0], $f1=[$1], EXPR$0=[CAST($1):DOUBLE])
>         HiveAggregate(group=[{4}], agg#0=[sum($1)])
>           HiveSemiJoin(condition=[=($6, $7)], joinType=[semi])
>             HiveJoin(condition=[=($2, $5)], joinType=[inner])
>               HiveJoin(condition=[=($0, $3)], joinType=[inner])
>                 HiveProject(wr_item_sk=[$1], wr_return_quantity=[$13], 
> wr_returned_date_sk=[$23])
>                   HiveFilter(condition=[AND(IS NOT NULL($23), IS NOT 
> NULL($1))])
>                     HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> web_returns]], table:alias=[web_returns])
>                 HiveProject(i_item_sk=[$0], i_item_id=[$1])
>                   HiveFilter(condition=[AND(IS NOT NULL($1), IS NOT 
> NULL($0))])
>                     HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> item]], table:alias=[item])
>               HiveProject(d_date_sk=[$0], d_date=[$2])
>                 HiveFilter(condition=[AND(IS NOT NULL($2), IS NOT NULL($0))])
>                   HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> date_dim]], table:alias=[date_dim])
>             HiveProject(d_date=[$0])
>               HiveSemiJoin(condition=[=($1, $2)], joinType=[semi])
>                 HiveProject(d_date=[$2], d_week_seq=[$4])
>                   HiveFilter(condition=[AND(IS NOT NULL($4), IS NOT 
> NULL($2))])
>                     HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> date_dim]], table:alias=[date_dim])
>                 HiveProject(d_week_seq=[$4])
>                   HiveFilter(condition=[AND(IN($2, 1998-01-02:DATE, 
> 1998-10-15:DATE, 1998-11-10:DATE), IS NOT NULL($4))])
>                     HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> date_dim]], table:alias=[date_dim])
> {noformat}
>  
> *Modified Query & Plan:*
> {noformat}
> explain cbo
> with sr_items as
> (select i_item_id item_id,
> sum(sr_return_quantity) sr_item_qty
> from store_returns,
> item,
> date_dim
> where sr_item_sk = i_item_sk
> and   d_date    in
> (select d_date
> from date_dim
> where d_date in ('1998-01-02','1998-10-15','1998-11-10'))
> and   sr_returned_date_sk   = d_date_sk
> group by i_item_id),
> cr_items as
> (select i_item_id item_id,
> sum(cr_return_quantity) cr_item_qty
> from catalog_returns,
> item,
> date_dim
> where cr_item_sk = i_item_sk
> and   d_date    in
> (select d_date
> from date_dim
> where  d_date in ('1998-01-02','1998-10-15','1998-11-10'))
> and   cr_returned_date_sk   = d_date_sk
> group by i_item_id),
> wr_items as
> (select i_item_id item_id,
> sum(wr_return_quantity) wr_item_qty
> from web_returns,
> item,
> date_dim
> where wr_item_sk = i_item_sk
> and   d_date    in
> (select d_date
> from date_dim
> where d_date in ('1998-01-02','1998-10-15','1998-11-10'))
> and   wr_returned_date_sk   = d_date_sk
> group by i_item_id)
> select  sr_items.item_id
> ,sr_item_qty
> ,sr_item_qty/(sr_item_qty+cr_item_qty+wr_item_qty)/3.0 * 100 sr_dev
> ,cr_item_qty
> ,cr_item_qty/(sr_item_qty+cr_item_qty+wr_item_qty)/3.0 * 100 cr_dev
> ,wr_item_qty
> ,wr_item_qty/(sr_item_qty+cr_item_qty+wr_item_qty)/3.0 * 100 wr_dev
> ,(sr_item_qty+cr_item_qty+wr_item_qty)/3.0 average
> from sr_items
> ,cr_items
> ,wr_items
> where sr_items.item_id=cr_items.item_id
> and sr_items.item_id=wr_items.item_id
> order by sr_items.item_id
> ,sr_item_qty
> limit 100
> INFO  : Starting task [Stage-3:EXPLAIN] in serial mode
> INFO  : Completed executing 
> command(queryId=hive_20220307062043_2847c12d-9c22-452e-aa84-3200a3b9018b); 
> Time taken: 0.827 seconds
> INFO  : OK
> Explain
> CBO PLAN:
> HiveSortLimit(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC], fetch=[100])
>   HiveProject(item_id=[$0], sr_item_qty=[$4], sr_dev=[*(/(/($5, CAST(+(+($4, 
> $1), $7)):DOUBLE), 3), 100)], cr_item_qty=[$1], cr_dev=[*(/(/($2, 
> CAST(+(+($4, $1), $7)):DOUBLE), 3), 100)], wr_item_qty=[$7], 
> wr_dev=[*(/(/($8, CAST(+(+($4, $1), $7)):DOUBLE), 3), 100)], 
> average=[/(CAST(+(+($4, $1), $7)):DECIMAL(19, 0), 3:DECIMAL(1, 0))])
>     HiveJoin(condition=[=($0, $6)], joinType=[inner])
>       HiveJoin(condition=[=($3, $0)], joinType=[inner])
>         HiveProject($f0=[$0], $f1=[$1], EXPR$0=[CAST($1):DOUBLE])
>           HiveAggregate(group=[{4}], agg#0=[sum($1)])
>             HiveSemiJoin(condition=[=($6, $7)], joinType=[semi])
>               HiveJoin(condition=[=($2, $5)], joinType=[inner])
>                 HiveJoin(condition=[=($0, $3)], joinType=[inner])
>                   HiveProject(cr_item_sk=[$1], cr_return_quantity=[$16], 
> cr_returned_date_sk=[$26])
>                     HiveFilter(condition=[AND(IS NOT NULL($26), IS NOT 
> NULL($1))])
>                       HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> catalog_returns]], table:alias=[catalog_returns])
>                   HiveProject(i_item_sk=[$0], i_item_id=[$1])
>                     HiveFilter(condition=[AND(IS NOT NULL($1), IS NOT 
> NULL($0))])
>                       HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> item]], table:alias=[item])
>                 HiveProject(d_date_sk=[$0], d_date=[$2])
>                   HiveFilter(condition=[AND(IN($2, 1998-01-02:DATE, 
> 1998-10-15:DATE, 1998-11-10:DATE), IS NOT NULL($0))])
>                     HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> date_dim]], table:alias=[date_dim])
>               HiveProject(d_date=[$2])
>                 HiveFilter(condition=[IN($2, 1998-01-02:DATE, 
> 1998-10-15:DATE, 1998-11-10:DATE)])
>                   HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> date_dim]], table:alias=[date_dim])
>         HiveProject($f0=[$0], $f1=[$1], EXPR$0=[CAST($1):DOUBLE])
>           HiveAggregate(group=[{4}], agg#0=[sum($1)])
>             HiveSemiJoin(condition=[=($6, $7)], joinType=[semi])
>               HiveJoin(condition=[=($2, $5)], joinType=[inner])
>                 HiveJoin(condition=[=($0, $3)], joinType=[inner])
>                   HiveProject(sr_item_sk=[$1], sr_return_quantity=[$9], 
> sr_returned_date_sk=[$19])
>                     HiveFilter(condition=[AND(IS NOT NULL($19), IS NOT 
> NULL($1))])
>                       HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> store_returns]], table:alias=[store_returns])
>                   HiveProject(i_item_sk=[$0], i_item_id=[$1])
>                     HiveFilter(condition=[AND(IS NOT NULL($1), IS NOT 
> NULL($0))])
>                       HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> item]], table:alias=[item])
>                 HiveProject(d_date_sk=[$0], d_date=[$2])
>                   HiveFilter(condition=[AND(IN($2, 1998-01-02:DATE, 
> 1998-10-15:DATE, 1998-11-10:DATE), IS NOT NULL($0))])
>                     HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> date_dim]], table:alias=[date_dim])
>               HiveProject(d_date=[$2])
>                 HiveFilter(condition=[IN($2, 1998-01-02:DATE, 
> 1998-10-15:DATE, 1998-11-10:DATE)])
>                   HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> date_dim]], table:alias=[date_dim])
>       HiveProject($f0=[$0], $f1=[$1], EXPR$0=[CAST($1):DOUBLE])
>         HiveAggregate(group=[{4}], agg#0=[sum($1)])
>           HiveSemiJoin(condition=[=($6, $7)], joinType=[semi])
>             HiveJoin(condition=[=($2, $5)], joinType=[inner])
>               HiveJoin(condition=[=($0, $3)], joinType=[inner])
>                 HiveProject(wr_item_sk=[$1], wr_return_quantity=[$13], 
> wr_returned_date_sk=[$23])
>                   HiveFilter(condition=[AND(IS NOT NULL($23), IS NOT 
> NULL($1))])
>                     HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> web_returns]], table:alias=[web_returns])
>                 HiveProject(i_item_sk=[$0], i_item_id=[$1])
>                   HiveFilter(condition=[AND(IS NOT NULL($1), IS NOT 
> NULL($0))])
>                     HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> item]], table:alias=[item])
>               HiveProject(d_date_sk=[$0], d_date=[$2])
>                 HiveFilter(condition=[AND(IN($2, 1998-01-02:DATE, 
> 1998-10-15:DATE, 1998-11-10:DATE), IS NOT NULL($0))])
>                   HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> date_dim]], table:alias=[date_dim])
>             HiveProject(d_date=[$2])
>               HiveFilter(condition=[IN($2, 1998-01-02:DATE, 1998-10-15:DATE, 
> 1998-11-10:DATE)])
>                 HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000, 
> date_dim]], table:alias=[date_dim])
>  {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to