[
https://issues.apache.org/jira/browse/IMPALA-10769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Rorke updated IMPALA-10769:
---------------------------------
Attachment: q78_partial_profile.txt
> Bad join order in TPC-DS Q78 caused by bad cardinality estimate in aggregation
> ------------------------------------------------------------------------------
>
> Key: IMPALA-10769
> URL: https://issues.apache.org/jira/browse/IMPALA-10769
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Affects Versions: Impala 4.0
> Reporter: David Rorke
> Priority: Major
> Attachments: q78_partial_profile.txt
>
>
> The plan for TPC-DS Q78 uses a bad join order with the largest input on the
> RHS, caused by a bad underestimate of the cardinality of the aggregation node
> in that join input.
> Summary below (full profile including query and plan attached):
> {noformat}
> F15:ROOT 1 1 0.000ns 0.000ns
> 4.01 MB 4.00 MB
> 38:MERGING-EXCHANGE 1 1 0.000ns 0.000ns 100
> 100 2.81 MB 1.15 MB UNPARTITIONED
> F04:EXCHANGE SENDER 10 120 33.333us 3.999ms
> 1.31 KB 0
> 20:TOP-N 10 120 32.900ms 264.000ms 12.00K
> 100 52.00 KB 9.38 KB
> 19:HASH JOIN 10 120 1s694ms 2s044ms 312.51K
> 8.80M 55.12 KB 0 RIGHT OUTER JOIN, PARTITIONED
> |--F16:JOIN BUILD 10 120 3s735ms 5s052ms
> 1.20 GB 17.00 MB
> | 37:EXCHANGE 10 120 634.767ms 1s004ms 1.44B
> 8.80M 15.80 MB 25.00 MB HASH(d_year,ss_item_sk,ss_customer_sk)
> | F09:EXCHANGE SENDER 10 120 4s736ms 5s179ms
> 279.25 KB 0
> | 18:HASH JOIN 10 120 979.068ms 1s140ms 1.44B
> 8.80M 43.12 KB 0 RIGHT OUTER JOIN, PARTITIONED
> | |--F17:JOIN BUILD 10 120 5s370ms 6s560ms
> 1.19 GB 8.50 MB
> | | 36:EXCHANGE 10 120 390.834ms 640.000ms 1.44B
> 8.80M 7.20 MB 17.50 MB HASH(d_year,ss_item_sk,ss_customer_sk)
> | | F14:EXCHANGE SENDER 10 120 3s330ms 4s148ms
> 279.25 KB 0
> | | 35:AGGREGATE 10 120 6s362ms 9s508ms 1.44B
> 8.80M 960.04 MB 1.02 GB FINALIZE
> | | 34:EXCHANGE 10 120 517.901ms 788.005ms 1.46B
> 1.72B 14.31 MB 17.50 MB HASH(d_year,ss_item_sk,ss_customer_sk)
> | | F12:EXCHANGE SENDER 10 120 8s196ms 9s168ms
> 3.53 MB 0
> | | 05:AGGREGATE 10 120 17s890ms 22s904ms 1.46B
> 1.72B 992.05 MB 780.95 MB STREAMING
> | | 04:HASH JOIN 10 120 512.300ms 672.001ms 1.48B
> 1.72B 37.12 KB 0 INNER JOIN, BROADCAST
> | | |--F18:JOIN BUILD 10 10 32.400ms 72.000ms
> 23.27 MB 23.25 MB
> | | | 33:EXCHANGE 10 10 0.000ns 0.000ns 365
> 373 16.00 KB 16.00 KB BROADCAST
> | | | F13:EXCHANGE SENDER 1 1 0.000ns 0.000ns
> 98.44 KB 0
> | | | 02:SCAN S3 1 1 8.000ms 8.000ms 365
> 373 1.62 MB 16.00 MB tpcds_3000_decimal_parquet.date_dim
> | | 03:HASH JOIN 10 120 3s542ms 4s200ms 1.48B
> 8.64B 42.12 KB 0 LEFT OUTER JOIN, PARTITIONED
> | | |--F19:JOIN BUILD 10 120 1s178ms 1s432ms
> 384.03 MB 382.93 MB
> | | | 32:EXCHANGE 10 120 138.600ms 200.000ms 863.99M
> 863.99M 4.35 MB 12.34 MB HASH(sr_item_sk,sr_ticket_number)
> | | | F11:EXCHANGE SENDER 10 120 1s982ms 2s628ms
> 4.16 MB 0
> | | | 01:SCAN S3 10 120 385.534ms 628.000ms 863.99M
> 863.99M 48.97 MB 40.00 MB tpcds_3000_decimal_parquet.store_returns
> | | 31:EXCHANGE 10 120 787.434ms 2s316ms 1.64B
> 8.64B 14.32 MB 15.62 MB HASH(ss_item_sk,ss_ticket_number)
> | | F10:EXCHANGE SENDER 10 120 7s394ms 11s732ms
> 3.61 MB 0
> | | 00:SCAN S3 10 120 2s575ms 4s356ms 1.64B
> 8.64B 98.15 MB 88.00 MB tpcds_3000_decimal_parquet.store_sales
> | 30:AGGREGATE 10 120 8s492ms 15s428ms 386.62M
> 430.99M 324.04 MB 260.31 MB FINALIZE
> | 29:EXCHANGE 10 120 257.634ms 1s024ms 386.62M
> 430.99M 14.32 MB 17.50 MB
> HASH(d_year,ws_item_sk,ws_bill_customer_sk)
> | F07:EXCHANGE SENDER 10 120 4s837ms 10s996ms
> 3.53 MB 0
> | 11:AGGREGATE 10 120 3s005ms 4s536ms 386.62M
> 430.99M 82.05 MB 195.24 MB STREAMING
> | 10:HASH JOIN 10 120 158.866ms 252.000ms 386.62M
> 430.99M 37.12 KB 0 INNER JOIN, BROADCAST
> | |--F20:JOIN BUILD 10 10 93.600ms 168.000ms
> 23.27 MB 23.25 MB
> | | 28:EXCHANGE 10 10 0.000ns 0.000ns 365
> 373 16.00 KB 16.00 KB BROADCAST
> | | F08:EXCHANGE SENDER 1 1 0.000ns 0.000ns
> 98.44 KB 0
> | | 08:SCAN S3 1 1 10s172ms 10s172ms 365
> 373 1.62 MB 16.00 MB tpcds_3000_decimal_parquet.date_dim
> | 09:HASH JOIN 10 120 880.001ms 1s096ms 386.63M
> 2.16B 42.12 KB 0 LEFT OUTER JOIN, PARTITIONED
> | |--F21:JOIN BUILD 10 120 632.367ms 1s383ms
> 96.03 MB 95.73 MB
> | | 27:EXCHANGE 10 120 34.266ms 108.000ms 216.00M
> 216.00M 1.09 MB 12.34 MB HASH(wr_item_sk,wr_order_number)
> | | F06:EXCHANGE SENDER 10 120 1s331ms 2s823ms
> 4.16 MB 0
> | | 07:SCAN S3 10 120 11s708ms 12s496ms 216.00M
> 216.00M 48.92 MB 24.00 MB tpcds_3000_decimal_parquet.web_returns
> | 26:EXCHANGE 10 120 348.834ms 2s176ms 429.58M
> 2.16B 14.38 MB 15.62 MB HASH(ws_item_sk,ws_order_number)
> | F05:EXCHANGE SENDER 10 120 2s616ms 6s356ms
> 3.61 MB 0
> | 06:SCAN S3 10 120 12s429ms 16s108ms 429.58M
> 2.16B 47.53 MB 88.00 MB tpcds_3000_decimal_parquet.web_sales
> 25:AGGREGATE 10 120 10s950ms 16s904ms 767.74M
> 855.91M 648.04 MB 516.96 MB FINALIZE
> 24:EXCHANGE 10 120 445.234ms 1s184ms 769.46M
> 855.91M 14.32 MB 17.50 MB
> HASH(d_year,cs_item_sk,cs_bill_customer_sk)
> F02:EXCHANGE SENDER 10 120 5s297ms 8s900ms
> 3.53 MB 0
> 17:AGGREGATE 10 120 5s660ms 8s160ms 769.46M
> 855.91M 128.05 MB 387.72 MB STREAMING
> 16:HASH JOIN 10 120 319.367ms 460.000ms 769.49M
> 855.91M 37.12 KB 0 INNER JOIN, BROADCAST
> |--F22:JOIN BUILD 10 10 67.600ms 179.999ms
> 23.27 MB 23.25 MB
> | 23:EXCHANGE 10 10 0.000ns 0.000ns 365
> 373 16.00 KB 16.00 KB BROADCAST
> | F03:EXCHANGE SENDER 1 1 0.000ns 0.000ns
> 98.44 KB 0
> | 14:SCAN S3 1 1 10s244ms 10s244ms 365
> 373 1.62 MB 16.00 MB tpcds_3000_decimal_parquet.date_dim
> 15:HASH JOIN 10 120 1s920ms 2s264ms 769.50M
> 4.32B 42.12 KB 0 LEFT OUTER JOIN, PARTITIONED
> |--F23:JOIN BUILD 10 120 1s448ms 1s832ms
> 192.03 MB 191.47 MB
> | 22:EXCHANGE 10 120 64.933ms 136.000ms 432.02M
> 432.02M 6.37 MB 12.34 MB HASH(cr_item_sk,cr_order_number)
> | F01:EXCHANGE SENDER 10 120 1s830ms 2s568ms
> 4.16 MB 0
> | 13:SCAN S3 10 120 11s753ms 12s600ms 432.02M
> 432.02M 8.92 MB 32.00 MB
> tpcds_3000_decimal_parquet.catalog_returns
> 21:EXCHANGE 10 120 592.967ms 1s720ms 855.00M
> 4.32B 14.28 MB 15.62 MB HASH(cs_item_sk,cs_order_number)
> F00:EXCHANGE SENDER 10 120 4s149ms 5s608ms
> 3.61 MB 0
> 12:SCAN S3 10 120 13s398ms 18s564ms 855.00M
> 4.32B 57.61 MB 88.00 MB tpcds_3000_decimal_parquet.catalog_sales
> {noformat}
> The problematic aggregation is plan node 35 which has an estimated output of
> 8.8M rows and an actual output of 1.44B.
> AggregateNode.java computes the cardinality by multiplying the NDV of the
> grouping expressions, then [capping this value at the number of input
> rows|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L285]
> and then finally [applying the
> conjuncts|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L237].
> The conjunct in this case is the filter d_year=2001 which has selectivity
> 1/196=0.0051. The fact that we first cap the cardinality at the input size
> and then apply a very selective predicate to the already capped value results
> in a bad underestimate in this case.
> It might be a better approach to apply the conjunct selectivity to the NDV
> product first and then cap the result of that based on the number of input
> rows, to reduce the risk of underestimation like we see in this case.
> [Earlier versions of the
> code|https://github.infra.cloudera.com/CDH/Impala/blob/cdh5-2.12.0_5.16.2/fe/src/main/java/org/apache/impala/planner/AggregationNode.java]
> did did the operations in this order (apply conjuncts first, then cap based
> on input rows) but the order of these operations was reversed by [this
> commit|https://github.com/apache/impala/commit/94f7d12f87f61cef6b341a5a1141b9bbe701de0b].
> It's not clear whether that commit intended to change the order of the
> conjunct application and the cap or if that was just an unintended side
> effect of code refactoring (maybe [~tarmstrong] can clarify?).
> I ran the same query with a straight_join hint to force the correct join
> order and got a 40% reduction in execution time. I also ran the query
> (unmodified) on an older version of the code prior to the commit that
> reversed the order of the conjunct application/cap and the resulting plan has
> a much better cardinality (see plan node 25) and correct join order (note
> this is a different cluster, larger scale factor, no mt_dop so don't directly
> compare the numbers to the more recent profile):
> {noformat}
> 38:MERGING-EXCHANGE 1 771.965us 771.965us 100 100
> 456.00 KB 0 UNPARTITIONED
> 20:TOP-N 19 50.204ms 71.974ms 1.90K 100
> 52.00 KB 9.38 KB
> 19:HASH JOIN 19 1m4s 1m33s 1.26M 5.75B
> 11.66 GB 9.23 GB LEFT OUTER JOIN, PARTITIONED
> |--37:EXCHANGE 19 3s219ms 3s969ms 2.56B 2.85B
> 7.43 MB 0 HASH(d_year,cs_item_sk,cs_bill_customer_sk)
> | 36:AGGREGATE 19 1m22s 1m31s 2.56B 2.85B
> 11.66 GB 175.36 GB FINALIZE
> | 35:EXCHANGE 19 3s347ms 3s845ms 2.56B 2.85B
> 13.53 MB 0 HASH(d_year,cs_item_sk,cs_bill_customer_sk)
> | 17:AGGREGATE 19 46s016ms 53s791ms 2.56B 2.85B
> 992.06 MB 175.36 GB STREAMING
> | 16:HASH JOIN 19 3s512ms 3s836ms 2.56B 2.85B
> 1.99 MB 1.94 MB INNER JOIN, BROADCAST
> | |--34:EXCHANGE 19 14.477us 24.595us 365 373
> 18.97 KB 0 BROADCAST
> | | 14:SCAN HDFS 1 51.493ms 51.493ms 365 373
> 1.13 MB 32.00 MB tpcds_10000_decimal_parquet.date_dim
> | 15:HASH JOIN 19 53s103ms 1m8s 2.57B 14.40B
> 3.22 GB 1.24 GB LEFT OUTER JOIN, PARTITIONED
> | |--33:EXCHANGE 19 1s193ms 1s712ms 1.44B 1.44B
> 10.49 MB 0 HASH(cr_item_sk,cr_order_number)
> | | 13:SCAN HDFS 19 325.397ms 359.082ms 1.44B 1.44B
> 127.32 MB 144.00 MB tpcds_10000_decimal_parquet.catalog_returns
> | 32:EXCHANGE 19 3s019ms 3s927ms 2.85B 14.40B
> 13.22 MB 0 HASH(cs_item_sk,cs_order_number)
> | 12:SCAN HDFS 19 2s021ms 3s140ms 2.85B 14.40B
> 355.82 MB 528.00 MB tpcds_10000_decimal_parquet.catalog_sales
> 18:HASH JOIN 19 40s974ms 55s991ms 4.30B 5.75B
> 5.88 GB 4.65 GB LEFT OUTER JOIN, PARTITIONED
> |--31:EXCHANGE 19 1s731ms 1s864ms 1.29B 1.44B
> 7.16 MB 0 HASH(d_year,ws_item_sk,ws_bill_customer_sk)
> | 30:AGGREGATE 19 41s146ms 49s094ms 1.29B 1.44B
> 5.88 GB 88.31 GB FINALIZE
> | 29:EXCHANGE 19 1s743ms 1s989ms 1.29B 1.44B
> 13.53 MB 0 HASH(d_year,ws_item_sk,ws_bill_customer_sk)
> | 11:AGGREGATE 19 18s910ms 20s163ms 1.29B 1.44B
> 128.06 MB 88.31 GB STREAMING
> | 10:HASH JOIN 19 1s742ms 1s801ms 1.29B 1.44B
> 1.99 MB 1.94 MB INNER JOIN, BROADCAST
> | |--28:EXCHANGE 19 15.043us 24.887us 365 373
> 18.97 KB 0 BROADCAST
> | | 08:SCAN HDFS 1 53.015ms 53.015ms 365 373
> 1.13 MB 32.00 MB tpcds_10000_decimal_parquet.date_dim
> | 09:HASH JOIN 19 22s658ms 26s425ms 1.29B 7.20B
> 1.63 GB 636.07 MB LEFT OUTER JOIN, PARTITIONED
> | |--27:EXCHANGE 19 597.705ms 796.106ms 720.02M 720.02M
> 10.50 MB 0 HASH(wr_item_sk,wr_order_number)
> | | 07:SCAN HDFS 19 185.858ms 315.385ms 720.02M 720.02M
> 98.54 MB 80.00 MB tpcds_10000_decimal_parquet.web_returns
> | 26:EXCHANGE 19 1s530ms 2s739ms 1.43B 7.20B
> 13.23 MB 0 HASH(ws_item_sk,ws_order_number)
> | 06:SCAN HDFS 19 1s589ms 2s720ms 1.43B 7.20B
> 369.88 MB 528.00 MB tpcds_10000_decimal_parquet.web_sales
> 25:AGGREGATE 19 1m35s 1m51s 4.30B 5.75B
> 21.04 GB 353.23 GB FINALIZE
> 24:EXCHANGE 19 4s893ms 5s539ms 4.31B 5.75B
> 13.53 MB 0 HASH(d_year,ss_item_sk,ss_customer_sk)
> 05:AGGREGATE 19 2m40s 3m8s 4.31B 5.75B
> 20.91 GB 353.23 GB STREAMING
> 04:HASH JOIN 19 6s349ms 7s056ms 4.92B 5.75B
> 1.99 MB 1.94 MB INNER JOIN, BROADCAST
> |--23:EXCHANGE 19 15.995us 38.006us 365 373
> 18.97 KB 0 BROADCAST
> | 02:SCAN HDFS 1 49.584ms 49.584ms 365 373
> 1.13 MB 32.00 MB tpcds_10000_decimal_parquet.date_dim
> 03:HASH JOIN 19 1m46s 2m20s 4.92B 28.80B
> 6.41 GB 2.48 GB LEFT OUTER JOIN, PARTITIONED
> |--22:EXCHANGE 19 2s227ms 3s111ms 2.88B 2.88B
> 10.49 MB 0 HASH(sr_item_sk,sr_ticket_number)
> | 01:SCAN HDFS 19 611.931ms 686.818ms 2.88B 2.88B
> 205.33 MB 176.00 MB tpcds_10000_decimal_parquet.store_returns
> 21:EXCHANGE 19 5s953ms 7s325ms 5.47B 28.80B
> 13.40 MB 0 HASH(ss_item_sk,ss_ticket_number)
> 00:SCAN HDFS 19 2s520ms 3s222ms 5.47B 28.80B
> 492.66 MB 528.00 MB tpcds_10000_decimal_parquet.store_sales
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]