[ 
https://issues.apache.org/jira/browse/IMPALA-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-3573:
----------------------------------
    Issue Type: Improvement  (was: Bug)

> Planner doesn't take into account runtime filter selectivity
> ------------------------------------------------------------
>
>                 Key: IMPALA-3573
>                 URL: https://issues.apache.org/jira/browse/IMPALA-3573
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 2.5.0
>            Reporter: Mostafa Mokhtar
>            Priority: Major
>              Labels: bushy, planner
>
> Applying selective runtime filters can drastically change the cardinality of 
> scan nodes, the planner doesn't cost the runtime filters as filters as a 
> result it misses out on a more selective plan. 
> In this particular query there is three fact to dimension joins 
> * (ss x d1) -> 389.28M rows
> * (sr x d2) -> 234.43M rows
> * (cs x d3) -> 12.85B rows
> The planner doesn't re-evaluate the cardinality estimation of ss, sr and cs 
> after the runtime filter are applied and puts ss as the left most node in the 
> plan where it should have been cs. 
> Ideally this should be a bushy plan.... 
> Query 
> {code}
> select i_item_id
>     ,i_item_desc
>     ,s_store_id
>     ,s_store_name
>     ,sum(ss_quantity)        as store_sales_quantity
>     ,sum(sr_return_quantity) as store_returns_quantity
>     ,sum(cs_quantity)        as catalog_sales_quantity
>  from
>     store_sales
>    ,store_returns
>    ,catalog_sales
>    ,date_dim             d1
>    ,date_dim             d2
>    ,date_dim             d3
>    ,store
>    ,item
>  where
>      d1.d_moy               = 4 
>  and d1.d_year              = 1999
>  and d1.d_date_sk           = ss_sold_date_sk
>  and i_item_sk              = ss_item_sk
>  and s_store_sk             = ss_store_sk
>  and ss_customer_sk         = sr_customer_sk
>  and ss_item_sk             = sr_item_sk
>  and ss_ticket_number       = sr_ticket_number
>  and sr_returned_date_sk    = d2.d_date_sk
>  and d2.d_moy               between 4 and  4 + 3 
>  and d2.d_year              = 1999
>  and sr_customer_sk         = cs_bill_customer_sk
>  and sr_item_sk             = cs_item_sk
>  and cs_sold_date_sk        = d3.d_date_sk     
>  and d3.d_year              in (1999,1999+1,1999+2)
>  group by
>     i_item_id
>    ,i_item_desc
>    ,s_store_id
>    ,s_store_name
>  order by
>     i_item_id 
>    ,i_item_desc
>    ,s_store_id
>    ,s_store_name
> limit 100
> {code}
> Plan 
> {code}
> 28:MERGING-EXCHANGE [UNPARTITIONED]
> |  order by: i_item_id ASC, i_item_desc ASC, s_store_id ASC, s_store_name ASC
> |  limit: 100
> |  hosts=20 per-host-mem=unavailable
> |  tuple-ids=9 row-size=224B cardinality=100
> |
> 16:TOP-N [LIMIT=100]
> |  order by: i_item_id ASC, i_item_desc ASC, s_store_id ASC, s_store_name ASC
> |  hosts=20 per-host-mem=21.89KB
> |  tuple-ids=9 row-size=224B cardinality=100
> |
> 27:AGGREGATE [FINALIZE]
> |  output: sum:merge(ss_quantity), sum:merge(sr_return_quantity), 
> sum:merge(cs_quantity)
> |  group by: i_item_id, i_item_desc, s_store_id, s_store_name
> |  hosts=20 per-host-mem=212.22MB
> |  tuple-ids=8 row-size=224B cardinality=335852270
> |
> 26:EXCHANGE [HASH(i_item_id,i_item_desc,s_store_id,s_store_name)]
> |  hosts=20 per-host-mem=0B
> |  tuple-ids=8 row-size=224B cardinality=335852270
> |
> 15:AGGREGATE [STREAMING]
> |  output: sum(ss_quantity), sum(sr_return_quantity), sum(cs_quantity)
> |  group by: i_item_id, i_item_desc, s_store_id, s_store_name
> |  hosts=20 per-host-mem=77.13GB
> |  tuple-ids=8 row-size=224B cardinality=335852270
> |
> 14:HASH JOIN [INNER JOIN, BROADCAST]
> |  hash predicates: ss_item_sk = i_item_sk
> |  runtime filters: RF000 <- i_item_sk
> |  hosts=20 per-host-mem=5.24MB
> |  tuple-ids=0,3,1,4,2,5,6,7 row-size=368B cardinality=335852270
> |
> |--25:EXCHANGE [BROADCAST]
> |  |  hosts=1 per-host-mem=0B
> |  |  tuple-ids=7 row-size=156B cardinality=32000
> |  |
> |  07:SCAN HDFS [tpcds_15000_decimal_parquet.item, RANDOM]
> |     partitions=1/1 files=1 size=3.14MB
> |     table stats: 32000 rows total
> |     column stats: all
> |     hosts=1 per-host-mem=48.00MB
> |     tuple-ids=7 row-size=156B cardinality=32000
> |
> 13:HASH JOIN [INNER JOIN, BROADCAST]
> |  hash predicates: ss_store_sk = s_store_sk
> |  runtime filters: RF001 <- s_store_sk
> |  hosts=20 per-host-mem=4.00KB
> |  tuple-ids=0,3,1,4,2,5,6 row-size=212B cardinality=335852270
> |
> |--24:EXCHANGE [BROADCAST]
> |  |  hosts=1 per-host-mem=0B
> |  |  tuple-ids=6 row-size=60B cardinality=62
> |  |
> |  06:SCAN HDFS [tpcds_15000_decimal_parquet.store, RANDOM]
> |     partitions=1/1 files=1 size=11.92KB
> |     table stats: 62 rows total
> |     column stats: all
> |     hosts=1 per-host-mem=48.00MB
> |     tuple-ids=6 row-size=60B cardinality=62
> |
> 12:HASH JOIN [INNER JOIN, BROADCAST]
> |  hash predicates: cs_sold_date_sk = d3.d_date_sk
> |  runtime filters: RF002 <- d3.d_date_sk
> |  hosts=20 per-host-mem=25.23KB
> |  tuple-ids=0,3,1,4,2,5 row-size=152B cardinality=335852270
> |
> |--23:EXCHANGE [BROADCAST]
> |  |  hosts=1 per-host-mem=0B
> |  |  tuple-ids=5 row-size=12B cardinality=1957
> |  |
> |  05:SCAN HDFS [tpcds_15000_decimal_parquet.date_dim d3, RANDOM]
> |     partitions=1/1 files=1 size=2.17MB
> |     predicates: d3.d_year IN (1999, 1999 + 1, 1999 + 2)
> |     table stats: 73049 rows total
> |     column stats: all
> |     hosts=1 per-host-mem=32.00MB
> |     tuple-ids=5 row-size=12B cardinality=1957
> |
> 11:HASH JOIN [INNER JOIN, PARTITIONED]
> |  hash predicates: sr_customer_sk = cs_bill_customer_sk, sr_item_sk = 
> cs_item_sk
> |  runtime filters: RF003 <- cs_bill_customer_sk, RF004 <- cs_item_sk
> |  hosts=20 per-host-mem=30.98GB
> |  tuple-ids=0,3,1,4,2 row-size=140B cardinality=352496606
> |
> |--22:EXCHANGE [HASH(cs_bill_customer_sk,cs_item_sk)]
> |  |  hosts=20 per-host-mem=0B
> |  |  tuple-ids=2 row-size=28B cardinality=21602442972
> |  |
> |  02:SCAN HDFS [tpcds_15000_decimal_parquet.catalog_sales, RANDOM]
> |     partitions=1837/1837 files=6122 size=1.33TB
> |     runtime filters: RF000 -> 
> tpcds_15000_decimal_parquet.catalog_sales.cs_item_sk, RF002 -> cs_sold_date_sk
> |     table stats: 21602442972 rows total
> |     column stats: all
> |     hosts=20 per-host-mem=264.00MB
> |     tuple-ids=2 row-size=28B cardinality=21602442972
> |
> 21:EXCHANGE [HASH(sr_customer_sk,sr_item_sk)]
> |  hosts=20 per-host-mem=0B
> |  tuple-ids=0,3,1,4 row-size=112B cardinality=352496606
> |
> 10:HASH JOIN [INNER JOIN, BROADCAST]
> |  hash predicates: sr_returned_date_sk = d2.d_date_sk
> |  runtime filters: RF005 <- d2.d_date_sk
> |  hosts=20 per-host-mem=3.54KB
> |  tuple-ids=0,3,1,4 row-size=112B cardinality=352496606
> |
> |--20:EXCHANGE [BROADCAST]
> |  |  hosts=1 per-host-mem=0B
> |  |  tuple-ids=4 row-size=16B cardinality=206
> |  |
> |  04:SCAN HDFS [tpcds_15000_decimal_parquet.date_dim d2, RANDOM]
> |     partitions=1/1 files=1 size=2.17MB
> |     predicates: d2.d_year = 1999, d2.d_moy >= 4, d2.d_moy <= 4 + 3
> |     table stats: 73049 rows total
> |     column stats: all
> |     hosts=1 per-host-mem=48.00MB
> |     tuple-ids=4 row-size=16B cardinality=206
> |
> 09:HASH JOIN [INNER JOIN, PARTITIONED]
> |  hash predicates: ss_customer_sk = sr_customer_sk, ss_item_sk = sr_item_sk, 
> ss_ticket_number = sr_ticket_number
> |  runtime filters: RF006 <- sr_customer_sk, RF007 <- sr_item_sk, RF008 <- 
> sr_ticket_number
> |  hosts=20 per-host-mem=7.96GB
> |  tuple-ids=0,3,1 row-size=96B cardinality=3834191129
> |
> |--19:EXCHANGE [HASH(sr_customer_sk,sr_item_sk,sr_ticket_number)]
> |  |  hosts=20 per-host-mem=0B
> |  |  tuple-ids=1 row-size=36B cardinality=4317083368
> |  |
> |  01:SCAN HDFS [tpcds_15000_decimal_parquet.store_returns, RANDOM]
> |     partitions=2004/2004 files=2022 size=201.98GB
> |     runtime filters: RF000 -> 
> tpcds_15000_decimal_parquet.store_returns.sr_item_sk, RF003 -> 
> sr_customer_sk, RF004 -> sr_item_sk, RF005 -> sr_returned_date_sk
> |     table stats: 4317083368 rows total
> |     column stats: all
> |     hosts=20 per-host-mem=352.00MB
> |     tuple-ids=1 row-size=36B cardinality=4317083368
> |
> 18:EXCHANGE [HASH(ss_customer_sk,ss_item_sk,ss_ticket_number)]
> |  hosts=20 per-host-mem=0B
> |  tuple-ids=0,3 row-size=60B cardinality=3834191129
> |
> 08:HASH JOIN [INNER JOIN, BROADCAST]
> |  hash predicates: ss_sold_date_sk = d1.d_date_sk
> |  runtime filters: RF009 <- d1.d_date_sk
> |  hosts=20 per-host-mem=3.11KB
> |  tuple-ids=0,3 row-size=60B cardinality=3834191129
> |
> |--17:EXCHANGE [BROADCAST]
> |  |  hosts=1 per-host-mem=0B
> |  |  tuple-ids=3 row-size=16B cardinality=181
> |  |
> |  03:SCAN HDFS [tpcds_15000_decimal_parquet.date_dim d1, RANDOM]
> |     partitions=1/1 files=1 size=2.17MB
> |     predicates: d1.d_year = 1999, d1.d_moy = 4
> |     table stats: 73049 rows total
> |     column stats: all
> |     hosts=1 per-host-mem=48.00MB
> |     tuple-ids=3 row-size=16B cardinality=181
> |
> 00:SCAN HDFS [tpcds_15000_decimal_parquet.store_sales, RANDOM]
>    partitions=1824/1824 files=6922 size=1.52TB
>    runtime filters: RF000 -> ss_item_sk, RF001 -> ss_store_sk, RF003 -> 
> tpcds_15000_decimal_parquet.store_sales.ss_customer_sk, RF004 -> 
> tpcds_15000_decimal_parquet.store_sales.ss_item_sk, RF006 -> ss_customer_sk, 
> RF007 -> ss_item_sk, RF008 -> ss_ticket_number, RF009 -> ss_sold_date_sk
>    table stats: 43202446268 rows total
>    column stats: all
>    hosts=20 per-host-mem=440.00MB
>    tuple-ids=0 row-size=44B cardinality=43202446268
> {code}
> Summary
> {code}
> Operator              #Hosts   Avg Time   Max Time    #Rows  Est. #Rows   
> Peak Mem  Est. Peak Mem  Detail                         
> ----------------------------------------------------------------------------------------------------------------------------------
> 28:MERGING-EXCHANGE        1    1.109ms    1.109ms    1.02K         100       
>    0        -1.00 B  UNPARTITIONED                  
> 16:TOP-N                  20    3.652ms    5.942ms    2.00K         100  
> 132.00 KB       21.89 KB                                 
> 27:AGGREGATE              20  275.802ms  458.876ms  442.90K     335.85M   
> 11.35 MB      212.22 MB  FINALIZE                       
> 26:EXCHANGE               20   59.647ms  258.173ms    1.13M     335.85M       
>    0              0  HASH(i_item_id,i_item_desc,... 
> 15:AGGREGATE              20  343.414ms  422.845ms    1.13M     335.85M  
> 143.24 MB       77.13 GB  STREAMING                      
> 14:HASH JOIN              20   30.930ms   41.410ms    2.20M     335.85M   
> 12.99 MB        5.24 MB  INNER JOIN, BROADCAST          
> |--25:EXCHANGE            20    1.052ms    1.677ms   32.00K      32.00K       
>    0              0  BROADCAST                      
> |  07:SCAN HDFS            1   18.325ms   18.325ms   32.00K      32.00K    
> 9.43 MB       48.00 MB  tpcds_15000_decimal_parquet... 
> 13:HASH JOIN              20   18.778ms   25.275ms    2.20M     335.85M    
> 3.49 MB        4.00 KB  INNER JOIN, BROADCAST          
> |--24:EXCHANGE            20   25.213us  301.428us       62          62       
>    0              0  BROADCAST                      
> |  06:SCAN HDFS            1    5.887ms    5.887ms       62          62  
> 159.00 KB       48.00 MB  tpcds_15000_decimal_parquet... 
> 12:HASH JOIN              20   24.246ms   33.507ms    2.23M     335.85M    
> 2.81 MB       25.23 KB  INNER JOIN, BROADCAST          
> |--23:EXCHANGE            20   34.326us   49.903us    1.10K       1.96K       
>    0              0  BROADCAST                      
> |  05:SCAN HDFS            1   24.594ms   24.594ms    1.10K       1.96K    
> 2.29 MB       32.00 MB  tpcds_15000_decimal_parquet... 
> 11:HASH JOIN              20       4m8s      5m31s    2.23M     352.50M   
> 38.14 GB       30.98 GB  INNER JOIN, PARTITIONED        
> |--22:EXCHANGE            20   20s051ms   54s305ms   12.85B      21.60B       
>    0              0  HASH(cs_bill_customer_sk,cs... 
> |  02:SCAN HDFS           20    3s716ms    4s533ms   12.85B      21.60B  
> 257.30 MB      264.00 MB  tpcds_15000_decimal_parquet... 
> 21:EXCHANGE               20   64.831ms  101.634ms   16.49M     352.50M       
>    0              0  HASH(sr_customer_sk,sr_item... 
> 10:HASH JOIN              20  300.445ms  389.672ms   16.49M     352.50M    
> 2.94 MB        3.54 KB  INNER JOIN, BROADCAST          
> |--20:EXCHANGE            20   11.347us   17.878us      122         206       
>    0              0  BROADCAST                      
> |  04:SCAN HDFS            1   12.903ms   12.903ms      122         206    
> 3.50 MB       48.00 MB  tpcds_15000_decimal_parquet... 
> 09:HASH JOIN              20    8s854ms   11s946ms   16.49M       3.83B    
> 1.01 GB        7.96 GB  INNER JOIN, PARTITIONED        
> |--19:EXCHANGE            20  444.716ms    1s130ms  234.43M       4.32B       
>    0              0  HASH(sr_customer_sk,sr_item... 
> |  01:SCAN HDFS           20    1s235ms    1s397ms  234.43M       4.32B  
> 351.41 MB      352.00 MB  tpcds_15000_decimal_parquet... 
> 18:EXCHANGE               20    1s216ms    2s815ms  389.28M       3.83B       
>    0              0  HASH(ss_customer_sk,ss_item... 
> 08:HASH JOIN              20  949.230ms    2s216ms  389.28M       3.83B    
> 2.71 MB        3.11 KB  INNER JOIN, BROADCAST          
> |--17:EXCHANGE            20   11.615us   16.053us       30         181       
>    0              0  BROADCAST                      
> |  03:SCAN HDFS            1   14.187ms   14.187ms       30         181    
> 2.89 MB       48.00 MB  tpcds_15000_decimal_parquet... 
> 00:SCAN HDFS              20  620.986ms  759.453ms  389.28M      43.20B  
> 344.09 MB      440.00 MB  tpcds_15000_decimal_parquet... 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to