[
https://issues.apache.org/jira/browse/IMPALA-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Armstrong updated IMPALA-3573:
----------------------------------
Issue Type: Improvement (was: Bug)
> Planner doesn't take into account runtime filter selectivity
> ------------------------------------------------------------
>
> Key: IMPALA-3573
> URL: https://issues.apache.org/jira/browse/IMPALA-3573
> Project: IMPALA
> Issue Type: Improvement
> Components: Frontend
> Affects Versions: Impala 2.5.0
> Reporter: Mostafa Mokhtar
> Priority: Major
> Labels: bushy, planner
>
> Applying selective runtime filters can drastically change the cardinality of
> scan nodes, the planner doesn't cost the runtime filters as filters as a
> result it misses out on a more selective plan.
> In this particular query there is three fact to dimension joins
> * (ss x d1) -> 389.28M rows
> * (sr x d2) -> 234.43M rows
> * (cs x d3) -> 12.85B rows
> The planner doesn't re-evaluate the cardinality estimation of ss, sr and cs
> after the runtime filter are applied and puts ss as the left most node in the
> plan where it should have been cs.
> Ideally this should be a bushy plan....
> Query
> {code}
> select i_item_id
> ,i_item_desc
> ,s_store_id
> ,s_store_name
> ,sum(ss_quantity) as store_sales_quantity
> ,sum(sr_return_quantity) as store_returns_quantity
> ,sum(cs_quantity) as catalog_sales_quantity
> from
> store_sales
> ,store_returns
> ,catalog_sales
> ,date_dim d1
> ,date_dim d2
> ,date_dim d3
> ,store
> ,item
> where
> d1.d_moy = 4
> and d1.d_year = 1999
> and d1.d_date_sk = ss_sold_date_sk
> and i_item_sk = ss_item_sk
> and s_store_sk = ss_store_sk
> and ss_customer_sk = sr_customer_sk
> and ss_item_sk = sr_item_sk
> and ss_ticket_number = sr_ticket_number
> and sr_returned_date_sk = d2.d_date_sk
> and d2.d_moy between 4 and 4 + 3
> and d2.d_year = 1999
> and sr_customer_sk = cs_bill_customer_sk
> and sr_item_sk = cs_item_sk
> and cs_sold_date_sk = d3.d_date_sk
> and d3.d_year in (1999,1999+1,1999+2)
> group by
> i_item_id
> ,i_item_desc
> ,s_store_id
> ,s_store_name
> order by
> i_item_id
> ,i_item_desc
> ,s_store_id
> ,s_store_name
> limit 100
> {code}
> Plan
> {code}
> 28:MERGING-EXCHANGE [UNPARTITIONED]
> | order by: i_item_id ASC, i_item_desc ASC, s_store_id ASC, s_store_name ASC
> | limit: 100
> | hosts=20 per-host-mem=unavailable
> | tuple-ids=9 row-size=224B cardinality=100
> |
> 16:TOP-N [LIMIT=100]
> | order by: i_item_id ASC, i_item_desc ASC, s_store_id ASC, s_store_name ASC
> | hosts=20 per-host-mem=21.89KB
> | tuple-ids=9 row-size=224B cardinality=100
> |
> 27:AGGREGATE [FINALIZE]
> | output: sum:merge(ss_quantity), sum:merge(sr_return_quantity),
> sum:merge(cs_quantity)
> | group by: i_item_id, i_item_desc, s_store_id, s_store_name
> | hosts=20 per-host-mem=212.22MB
> | tuple-ids=8 row-size=224B cardinality=335852270
> |
> 26:EXCHANGE [HASH(i_item_id,i_item_desc,s_store_id,s_store_name)]
> | hosts=20 per-host-mem=0B
> | tuple-ids=8 row-size=224B cardinality=335852270
> |
> 15:AGGREGATE [STREAMING]
> | output: sum(ss_quantity), sum(sr_return_quantity), sum(cs_quantity)
> | group by: i_item_id, i_item_desc, s_store_id, s_store_name
> | hosts=20 per-host-mem=77.13GB
> | tuple-ids=8 row-size=224B cardinality=335852270
> |
> 14:HASH JOIN [INNER JOIN, BROADCAST]
> | hash predicates: ss_item_sk = i_item_sk
> | runtime filters: RF000 <- i_item_sk
> | hosts=20 per-host-mem=5.24MB
> | tuple-ids=0,3,1,4,2,5,6,7 row-size=368B cardinality=335852270
> |
> |--25:EXCHANGE [BROADCAST]
> | | hosts=1 per-host-mem=0B
> | | tuple-ids=7 row-size=156B cardinality=32000
> | |
> | 07:SCAN HDFS [tpcds_15000_decimal_parquet.item, RANDOM]
> | partitions=1/1 files=1 size=3.14MB
> | table stats: 32000 rows total
> | column stats: all
> | hosts=1 per-host-mem=48.00MB
> | tuple-ids=7 row-size=156B cardinality=32000
> |
> 13:HASH JOIN [INNER JOIN, BROADCAST]
> | hash predicates: ss_store_sk = s_store_sk
> | runtime filters: RF001 <- s_store_sk
> | hosts=20 per-host-mem=4.00KB
> | tuple-ids=0,3,1,4,2,5,6 row-size=212B cardinality=335852270
> |
> |--24:EXCHANGE [BROADCAST]
> | | hosts=1 per-host-mem=0B
> | | tuple-ids=6 row-size=60B cardinality=62
> | |
> | 06:SCAN HDFS [tpcds_15000_decimal_parquet.store, RANDOM]
> | partitions=1/1 files=1 size=11.92KB
> | table stats: 62 rows total
> | column stats: all
> | hosts=1 per-host-mem=48.00MB
> | tuple-ids=6 row-size=60B cardinality=62
> |
> 12:HASH JOIN [INNER JOIN, BROADCAST]
> | hash predicates: cs_sold_date_sk = d3.d_date_sk
> | runtime filters: RF002 <- d3.d_date_sk
> | hosts=20 per-host-mem=25.23KB
> | tuple-ids=0,3,1,4,2,5 row-size=152B cardinality=335852270
> |
> |--23:EXCHANGE [BROADCAST]
> | | hosts=1 per-host-mem=0B
> | | tuple-ids=5 row-size=12B cardinality=1957
> | |
> | 05:SCAN HDFS [tpcds_15000_decimal_parquet.date_dim d3, RANDOM]
> | partitions=1/1 files=1 size=2.17MB
> | predicates: d3.d_year IN (1999, 1999 + 1, 1999 + 2)
> | table stats: 73049 rows total
> | column stats: all
> | hosts=1 per-host-mem=32.00MB
> | tuple-ids=5 row-size=12B cardinality=1957
> |
> 11:HASH JOIN [INNER JOIN, PARTITIONED]
> | hash predicates: sr_customer_sk = cs_bill_customer_sk, sr_item_sk =
> cs_item_sk
> | runtime filters: RF003 <- cs_bill_customer_sk, RF004 <- cs_item_sk
> | hosts=20 per-host-mem=30.98GB
> | tuple-ids=0,3,1,4,2 row-size=140B cardinality=352496606
> |
> |--22:EXCHANGE [HASH(cs_bill_customer_sk,cs_item_sk)]
> | | hosts=20 per-host-mem=0B
> | | tuple-ids=2 row-size=28B cardinality=21602442972
> | |
> | 02:SCAN HDFS [tpcds_15000_decimal_parquet.catalog_sales, RANDOM]
> | partitions=1837/1837 files=6122 size=1.33TB
> | runtime filters: RF000 ->
> tpcds_15000_decimal_parquet.catalog_sales.cs_item_sk, RF002 -> cs_sold_date_sk
> | table stats: 21602442972 rows total
> | column stats: all
> | hosts=20 per-host-mem=264.00MB
> | tuple-ids=2 row-size=28B cardinality=21602442972
> |
> 21:EXCHANGE [HASH(sr_customer_sk,sr_item_sk)]
> | hosts=20 per-host-mem=0B
> | tuple-ids=0,3,1,4 row-size=112B cardinality=352496606
> |
> 10:HASH JOIN [INNER JOIN, BROADCAST]
> | hash predicates: sr_returned_date_sk = d2.d_date_sk
> | runtime filters: RF005 <- d2.d_date_sk
> | hosts=20 per-host-mem=3.54KB
> | tuple-ids=0,3,1,4 row-size=112B cardinality=352496606
> |
> |--20:EXCHANGE [BROADCAST]
> | | hosts=1 per-host-mem=0B
> | | tuple-ids=4 row-size=16B cardinality=206
> | |
> | 04:SCAN HDFS [tpcds_15000_decimal_parquet.date_dim d2, RANDOM]
> | partitions=1/1 files=1 size=2.17MB
> | predicates: d2.d_year = 1999, d2.d_moy >= 4, d2.d_moy <= 4 + 3
> | table stats: 73049 rows total
> | column stats: all
> | hosts=1 per-host-mem=48.00MB
> | tuple-ids=4 row-size=16B cardinality=206
> |
> 09:HASH JOIN [INNER JOIN, PARTITIONED]
> | hash predicates: ss_customer_sk = sr_customer_sk, ss_item_sk = sr_item_sk,
> ss_ticket_number = sr_ticket_number
> | runtime filters: RF006 <- sr_customer_sk, RF007 <- sr_item_sk, RF008 <-
> sr_ticket_number
> | hosts=20 per-host-mem=7.96GB
> | tuple-ids=0,3,1 row-size=96B cardinality=3834191129
> |
> |--19:EXCHANGE [HASH(sr_customer_sk,sr_item_sk,sr_ticket_number)]
> | | hosts=20 per-host-mem=0B
> | | tuple-ids=1 row-size=36B cardinality=4317083368
> | |
> | 01:SCAN HDFS [tpcds_15000_decimal_parquet.store_returns, RANDOM]
> | partitions=2004/2004 files=2022 size=201.98GB
> | runtime filters: RF000 ->
> tpcds_15000_decimal_parquet.store_returns.sr_item_sk, RF003 ->
> sr_customer_sk, RF004 -> sr_item_sk, RF005 -> sr_returned_date_sk
> | table stats: 4317083368 rows total
> | column stats: all
> | hosts=20 per-host-mem=352.00MB
> | tuple-ids=1 row-size=36B cardinality=4317083368
> |
> 18:EXCHANGE [HASH(ss_customer_sk,ss_item_sk,ss_ticket_number)]
> | hosts=20 per-host-mem=0B
> | tuple-ids=0,3 row-size=60B cardinality=3834191129
> |
> 08:HASH JOIN [INNER JOIN, BROADCAST]
> | hash predicates: ss_sold_date_sk = d1.d_date_sk
> | runtime filters: RF009 <- d1.d_date_sk
> | hosts=20 per-host-mem=3.11KB
> | tuple-ids=0,3 row-size=60B cardinality=3834191129
> |
> |--17:EXCHANGE [BROADCAST]
> | | hosts=1 per-host-mem=0B
> | | tuple-ids=3 row-size=16B cardinality=181
> | |
> | 03:SCAN HDFS [tpcds_15000_decimal_parquet.date_dim d1, RANDOM]
> | partitions=1/1 files=1 size=2.17MB
> | predicates: d1.d_year = 1999, d1.d_moy = 4
> | table stats: 73049 rows total
> | column stats: all
> | hosts=1 per-host-mem=48.00MB
> | tuple-ids=3 row-size=16B cardinality=181
> |
> 00:SCAN HDFS [tpcds_15000_decimal_parquet.store_sales, RANDOM]
> partitions=1824/1824 files=6922 size=1.52TB
> runtime filters: RF000 -> ss_item_sk, RF001 -> ss_store_sk, RF003 ->
> tpcds_15000_decimal_parquet.store_sales.ss_customer_sk, RF004 ->
> tpcds_15000_decimal_parquet.store_sales.ss_item_sk, RF006 -> ss_customer_sk,
> RF007 -> ss_item_sk, RF008 -> ss_ticket_number, RF009 -> ss_sold_date_sk
> table stats: 43202446268 rows total
> column stats: all
> hosts=20 per-host-mem=440.00MB
> tuple-ids=0 row-size=44B cardinality=43202446268
> {code}
> Summary
> {code}
> Operator #Hosts Avg Time Max Time #Rows Est. #Rows
> Peak Mem Est. Peak Mem Detail
> ----------------------------------------------------------------------------------------------------------------------------------
> 28:MERGING-EXCHANGE 1 1.109ms 1.109ms 1.02K 100
> 0 -1.00 B UNPARTITIONED
> 16:TOP-N 20 3.652ms 5.942ms 2.00K 100
> 132.00 KB 21.89 KB
> 27:AGGREGATE 20 275.802ms 458.876ms 442.90K 335.85M
> 11.35 MB 212.22 MB FINALIZE
> 26:EXCHANGE 20 59.647ms 258.173ms 1.13M 335.85M
> 0 0 HASH(i_item_id,i_item_desc,...
> 15:AGGREGATE 20 343.414ms 422.845ms 1.13M 335.85M
> 143.24 MB 77.13 GB STREAMING
> 14:HASH JOIN 20 30.930ms 41.410ms 2.20M 335.85M
> 12.99 MB 5.24 MB INNER JOIN, BROADCAST
> |--25:EXCHANGE 20 1.052ms 1.677ms 32.00K 32.00K
> 0 0 BROADCAST
> | 07:SCAN HDFS 1 18.325ms 18.325ms 32.00K 32.00K
> 9.43 MB 48.00 MB tpcds_15000_decimal_parquet...
> 13:HASH JOIN 20 18.778ms 25.275ms 2.20M 335.85M
> 3.49 MB 4.00 KB INNER JOIN, BROADCAST
> |--24:EXCHANGE 20 25.213us 301.428us 62 62
> 0 0 BROADCAST
> | 06:SCAN HDFS 1 5.887ms 5.887ms 62 62
> 159.00 KB 48.00 MB tpcds_15000_decimal_parquet...
> 12:HASH JOIN 20 24.246ms 33.507ms 2.23M 335.85M
> 2.81 MB 25.23 KB INNER JOIN, BROADCAST
> |--23:EXCHANGE 20 34.326us 49.903us 1.10K 1.96K
> 0 0 BROADCAST
> | 05:SCAN HDFS 1 24.594ms 24.594ms 1.10K 1.96K
> 2.29 MB 32.00 MB tpcds_15000_decimal_parquet...
> 11:HASH JOIN 20 4m8s 5m31s 2.23M 352.50M
> 38.14 GB 30.98 GB INNER JOIN, PARTITIONED
> |--22:EXCHANGE 20 20s051ms 54s305ms 12.85B 21.60B
> 0 0 HASH(cs_bill_customer_sk,cs...
> | 02:SCAN HDFS 20 3s716ms 4s533ms 12.85B 21.60B
> 257.30 MB 264.00 MB tpcds_15000_decimal_parquet...
> 21:EXCHANGE 20 64.831ms 101.634ms 16.49M 352.50M
> 0 0 HASH(sr_customer_sk,sr_item...
> 10:HASH JOIN 20 300.445ms 389.672ms 16.49M 352.50M
> 2.94 MB 3.54 KB INNER JOIN, BROADCAST
> |--20:EXCHANGE 20 11.347us 17.878us 122 206
> 0 0 BROADCAST
> | 04:SCAN HDFS 1 12.903ms 12.903ms 122 206
> 3.50 MB 48.00 MB tpcds_15000_decimal_parquet...
> 09:HASH JOIN 20 8s854ms 11s946ms 16.49M 3.83B
> 1.01 GB 7.96 GB INNER JOIN, PARTITIONED
> |--19:EXCHANGE 20 444.716ms 1s130ms 234.43M 4.32B
> 0 0 HASH(sr_customer_sk,sr_item...
> | 01:SCAN HDFS 20 1s235ms 1s397ms 234.43M 4.32B
> 351.41 MB 352.00 MB tpcds_15000_decimal_parquet...
> 18:EXCHANGE 20 1s216ms 2s815ms 389.28M 3.83B
> 0 0 HASH(ss_customer_sk,ss_item...
> 08:HASH JOIN 20 949.230ms 2s216ms 389.28M 3.83B
> 2.71 MB 3.11 KB INNER JOIN, BROADCAST
> |--17:EXCHANGE 20 11.615us 16.053us 30 181
> 0 0 BROADCAST
> | 03:SCAN HDFS 1 14.187ms 14.187ms 30 181
> 2.89 MB 48.00 MB tpcds_15000_decimal_parquet...
> 00:SCAN HDFS 20 620.986ms 759.453ms 389.28M 43.20B
> 344.09 MB 440.00 MB tpcds_15000_decimal_parquet...
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]