[ 
https://issues.apache.org/jira/browse/IMPALA-10769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Rorke updated IMPALA-10769:
---------------------------------
    Labels: tpc-ds  (was: )

> Bad join order in TPC-DS Q78 caused by bad cardinality estimate in aggregation
> ------------------------------------------------------------------------------
>
>                 Key: IMPALA-10769
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10769
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 4.0
>            Reporter: David Rorke
>            Priority: Major
>              Labels: tpc-ds
>         Attachments: q78_partial_profile.txt
>
>
> The plan for TPC-DS Q78 uses a bad join order with the largest input on the 
> RHS, caused by a bad underestimate of the cardinality of the aggregation node 
> in that join input.
> Summary below (full profile including query and plan attached):
> {noformat}
> F15:ROOT                            1      1    0.000ns    0.000ns            
>              4.01 MB        4.00 MB
> 38:MERGING-EXCHANGE                 1      1    0.000ns    0.000ns      100   
>       100    2.81 MB        1.15 MB  UNPARTITIONED
> F04:EXCHANGE SENDER                10    120   33.333us    3.999ms            
>              1.31 KB              0
> 20:TOP-N                           10    120   32.900ms  264.000ms   12.00K   
>       100   52.00 KB        9.38 KB
> 19:HASH JOIN                       10    120    1s694ms    2s044ms  312.51K   
>     8.80M   55.12 KB              0  RIGHT OUTER JOIN, PARTITIONED
> |--F16:JOIN BUILD                  10    120    3s735ms    5s052ms            
>              1.20 GB       17.00 MB
> |  37:EXCHANGE                     10    120  634.767ms    1s004ms    1.44B   
>     8.80M   15.80 MB       25.00 MB  HASH(d_year,ss_item_sk,ss_customer_sk)
> |  F09:EXCHANGE SENDER             10    120    4s736ms    5s179ms            
>            279.25 KB              0
> |  18:HASH JOIN                    10    120  979.068ms    1s140ms    1.44B   
>     8.80M   43.12 KB              0  RIGHT OUTER JOIN, PARTITIONED
> |  |--F17:JOIN BUILD               10    120    5s370ms    6s560ms            
>              1.19 GB        8.50 MB
> |  |  36:EXCHANGE                  10    120  390.834ms  640.000ms    1.44B   
>     8.80M    7.20 MB       17.50 MB  HASH(d_year,ss_item_sk,ss_customer_sk)
> |  |  F14:EXCHANGE SENDER          10    120    3s330ms    4s148ms            
>            279.25 KB              0
> |  |  35:AGGREGATE                 10    120    6s362ms    9s508ms    1.44B   
>     8.80M  960.04 MB        1.02 GB  FINALIZE
> |  |  34:EXCHANGE                  10    120  517.901ms  788.005ms    1.46B   
>     1.72B   14.31 MB       17.50 MB  HASH(d_year,ss_item_sk,ss_customer_sk)
> |  |  F12:EXCHANGE SENDER          10    120    8s196ms    9s168ms            
>              3.53 MB              0
> |  |  05:AGGREGATE                 10    120   17s890ms   22s904ms    1.46B   
>     1.72B  992.05 MB      780.95 MB  STREAMING
> |  |  04:HASH JOIN                 10    120  512.300ms  672.001ms    1.48B   
>     1.72B   37.12 KB              0  INNER JOIN, BROADCAST
> |  |  |--F18:JOIN BUILD            10     10   32.400ms   72.000ms            
>             23.27 MB       23.25 MB
> |  |  |  33:EXCHANGE               10     10    0.000ns    0.000ns      365   
>       373   16.00 KB       16.00 KB  BROADCAST
> |  |  |  F13:EXCHANGE SENDER        1      1    0.000ns    0.000ns            
>             98.44 KB              0
> |  |  |  02:SCAN S3                 1      1    8.000ms    8.000ms      365   
>       373    1.62 MB       16.00 MB  tpcds_3000_decimal_parquet.date_dim
> |  |  03:HASH JOIN                 10    120    3s542ms    4s200ms    1.48B   
>     8.64B   42.12 KB              0  LEFT OUTER JOIN, PARTITIONED
> |  |  |--F19:JOIN BUILD            10    120    1s178ms    1s432ms            
>            384.03 MB      382.93 MB
> |  |  |  32:EXCHANGE               10    120  138.600ms  200.000ms  863.99M   
>   863.99M    4.35 MB       12.34 MB  HASH(sr_item_sk,sr_ticket_number)
> |  |  |  F11:EXCHANGE SENDER       10    120    1s982ms    2s628ms            
>              4.16 MB              0
> |  |  |  01:SCAN S3                10    120  385.534ms  628.000ms  863.99M   
>   863.99M   48.97 MB       40.00 MB  tpcds_3000_decimal_parquet.store_returns
> |  |  31:EXCHANGE                  10    120  787.434ms    2s316ms    1.64B   
>     8.64B   14.32 MB       15.62 MB  HASH(ss_item_sk,ss_ticket_number)
> |  |  F10:EXCHANGE SENDER          10    120    7s394ms   11s732ms            
>              3.61 MB              0
> |  |  00:SCAN S3                   10    120    2s575ms    4s356ms    1.64B   
>     8.64B   98.15 MB       88.00 MB  tpcds_3000_decimal_parquet.store_sales
> |  30:AGGREGATE                    10    120    8s492ms   15s428ms  386.62M   
>   430.99M  324.04 MB      260.31 MB  FINALIZE
> |  29:EXCHANGE                     10    120  257.634ms    1s024ms  386.62M   
>   430.99M   14.32 MB       17.50 MB  
> HASH(d_year,ws_item_sk,ws_bill_customer_sk)
> |  F07:EXCHANGE SENDER             10    120    4s837ms   10s996ms            
>              3.53 MB              0
> |  11:AGGREGATE                    10    120    3s005ms    4s536ms  386.62M   
>   430.99M   82.05 MB      195.24 MB  STREAMING
> |  10:HASH JOIN                    10    120  158.866ms  252.000ms  386.62M   
>   430.99M   37.12 KB              0  INNER JOIN, BROADCAST
> |  |--F20:JOIN BUILD               10     10   93.600ms  168.000ms            
>             23.27 MB       23.25 MB
> |  |  28:EXCHANGE                  10     10    0.000ns    0.000ns      365   
>       373   16.00 KB       16.00 KB  BROADCAST
> |  |  F08:EXCHANGE SENDER           1      1    0.000ns    0.000ns            
>             98.44 KB              0
> |  |  08:SCAN S3                    1      1   10s172ms   10s172ms      365   
>       373    1.62 MB       16.00 MB  tpcds_3000_decimal_parquet.date_dim
> |  09:HASH JOIN                    10    120  880.001ms    1s096ms  386.63M   
>     2.16B   42.12 KB              0  LEFT OUTER JOIN, PARTITIONED
> |  |--F21:JOIN BUILD               10    120  632.367ms    1s383ms            
>             96.03 MB       95.73 MB
> |  |  27:EXCHANGE                  10    120   34.266ms  108.000ms  216.00M   
>   216.00M    1.09 MB       12.34 MB  HASH(wr_item_sk,wr_order_number)
> |  |  F06:EXCHANGE SENDER          10    120    1s331ms    2s823ms            
>              4.16 MB              0
> |  |  07:SCAN S3                   10    120   11s708ms   12s496ms  216.00M   
>   216.00M   48.92 MB       24.00 MB  tpcds_3000_decimal_parquet.web_returns
> |  26:EXCHANGE                     10    120  348.834ms    2s176ms  429.58M   
>     2.16B   14.38 MB       15.62 MB  HASH(ws_item_sk,ws_order_number)
> |  F05:EXCHANGE SENDER             10    120    2s616ms    6s356ms            
>              3.61 MB              0
> |  06:SCAN S3                      10    120   12s429ms   16s108ms  429.58M   
>     2.16B   47.53 MB       88.00 MB  tpcds_3000_decimal_parquet.web_sales
> 25:AGGREGATE                       10    120   10s950ms   16s904ms  767.74M   
>   855.91M  648.04 MB      516.96 MB  FINALIZE
> 24:EXCHANGE                        10    120  445.234ms    1s184ms  769.46M   
>   855.91M   14.32 MB       17.50 MB  
> HASH(d_year,cs_item_sk,cs_bill_customer_sk)
> F02:EXCHANGE SENDER                10    120    5s297ms    8s900ms            
>              3.53 MB              0
> 17:AGGREGATE                       10    120    5s660ms    8s160ms  769.46M   
>   855.91M  128.05 MB      387.72 MB  STREAMING
> 16:HASH JOIN                       10    120  319.367ms  460.000ms  769.49M   
>   855.91M   37.12 KB              0  INNER JOIN, BROADCAST
> |--F22:JOIN BUILD                  10     10   67.600ms  179.999ms            
>             23.27 MB       23.25 MB
> |  23:EXCHANGE                     10     10    0.000ns    0.000ns      365   
>       373   16.00 KB       16.00 KB  BROADCAST
> |  F03:EXCHANGE SENDER              1      1    0.000ns    0.000ns            
>             98.44 KB              0
> |  14:SCAN S3                       1      1   10s244ms   10s244ms      365   
>       373    1.62 MB       16.00 MB  tpcds_3000_decimal_parquet.date_dim
> 15:HASH JOIN                       10    120    1s920ms    2s264ms  769.50M   
>     4.32B   42.12 KB              0  LEFT OUTER JOIN, PARTITIONED
> |--F23:JOIN BUILD                  10    120    1s448ms    1s832ms            
>            192.03 MB      191.47 MB
> |  22:EXCHANGE                     10    120   64.933ms  136.000ms  432.02M   
>   432.02M    6.37 MB       12.34 MB  HASH(cr_item_sk,cr_order_number)
> |  F01:EXCHANGE SENDER             10    120    1s830ms    2s568ms            
>              4.16 MB              0
> |  13:SCAN S3                      10    120   11s753ms   12s600ms  432.02M   
>   432.02M    8.92 MB       32.00 MB  
> tpcds_3000_decimal_parquet.catalog_returns
> 21:EXCHANGE                        10    120  592.967ms    1s720ms  855.00M   
>     4.32B   14.28 MB       15.62 MB  HASH(cs_item_sk,cs_order_number)
> F00:EXCHANGE SENDER                10    120    4s149ms    5s608ms            
>              3.61 MB              0
> 12:SCAN S3                         10    120   13s398ms   18s564ms  855.00M   
>     4.32B   57.61 MB       88.00 MB  tpcds_3000_decimal_parquet.catalog_sales
> {noformat}
> The problematic aggregation is plan node 35 which has an estimated output of 
> 8.8M rows and an actual output of 1.44B.
> AggregateNode.java computes the cardinality by multiplying the NDV of the 
> grouping expressions, then [capping this value at the number of input 
> rows|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L285]
>  and then finally [applying the 
> conjuncts|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L237].
>  The conjunct in this case is the filter d_year=2001 which has selectivity 
> 1/196=0.0051. The fact that we first cap the cardinality at the input size 
> and then apply a very selective predicate to the already capped value results 
> in a bad underestimate in this case.
> It might be a better approach to apply the conjunct selectivity to the NDV 
> product first and then cap the result of that based on the number of input 
> rows, to reduce the risk of underestimation like we see in this case. 
> [Earlier versions of the 
> code|https://github.infra.cloudera.com/CDH/Impala/blob/cdh5-2.12.0_5.16.2/fe/src/main/java/org/apache/impala/planner/AggregationNode.java]
>  did did the operations in this order (apply conjuncts first, then cap based 
> on input rows) but the order of these operations was reversed by [this 
> commit|https://github.com/apache/impala/commit/94f7d12f87f61cef6b341a5a1141b9bbe701de0b].
>  It's not clear whether that commit intended to change the order of the 
> conjunct application and the cap or if that was just an unintended side 
> effect of code refactoring (maybe [~tarmstrong] can clarify?).
> I ran the same query with a straight_join hint to force the correct join 
> order and got a 40% reduction in execution time. I also ran the query 
> (unmodified) on an older version of the code prior to the commit that 
> reversed the order of the conjunct application/cap and the resulting plan has 
> a much better cardinality (see plan node 25) and correct join order (note 
> this is a different cluster, larger scale factor, no mt_dop so don't directly 
> compare the numbers to the more recent profile):
> {noformat}
> 38:MERGING-EXCHANGE        1  771.965us  771.965us      100         100  
> 456.00 KB              0  UNPARTITIONED
> 20:TOP-N                  19   50.204ms   71.974ms    1.90K         100   
> 52.00 KB        9.38 KB
> 19:HASH JOIN              19       1m4s      1m33s    1.26M       5.75B   
> 11.66 GB        9.23 GB  LEFT OUTER JOIN, PARTITIONED
> |--37:EXCHANGE            19    3s219ms    3s969ms    2.56B       2.85B    
> 7.43 MB              0  HASH(d_year,cs_item_sk,cs_bill_customer_sk)
> |  36:AGGREGATE           19      1m22s      1m31s    2.56B       2.85B   
> 11.66 GB      175.36 GB  FINALIZE
> |  35:EXCHANGE            19    3s347ms    3s845ms    2.56B       2.85B   
> 13.53 MB              0  HASH(d_year,cs_item_sk,cs_bill_customer_sk)
> |  17:AGGREGATE           19   46s016ms   53s791ms    2.56B       2.85B  
> 992.06 MB      175.36 GB  STREAMING
> |  16:HASH JOIN           19    3s512ms    3s836ms    2.56B       2.85B    
> 1.99 MB        1.94 MB  INNER JOIN, BROADCAST
> |  |--34:EXCHANGE         19   14.477us   24.595us      365         373   
> 18.97 KB              0  BROADCAST
> |  |  14:SCAN HDFS         1   51.493ms   51.493ms      365         373    
> 1.13 MB       32.00 MB  tpcds_10000_decimal_parquet.date_dim
> |  15:HASH JOIN           19   53s103ms       1m8s    2.57B      14.40B    
> 3.22 GB        1.24 GB  LEFT OUTER JOIN, PARTITIONED
> |  |--33:EXCHANGE         19    1s193ms    1s712ms    1.44B       1.44B   
> 10.49 MB              0  HASH(cr_item_sk,cr_order_number)
> |  |  13:SCAN HDFS        19  325.397ms  359.082ms    1.44B       1.44B  
> 127.32 MB      144.00 MB  tpcds_10000_decimal_parquet.catalog_returns
> |  32:EXCHANGE            19    3s019ms    3s927ms    2.85B      14.40B   
> 13.22 MB              0  HASH(cs_item_sk,cs_order_number)
> |  12:SCAN HDFS           19    2s021ms    3s140ms    2.85B      14.40B  
> 355.82 MB      528.00 MB  tpcds_10000_decimal_parquet.catalog_sales
> 18:HASH JOIN              19   40s974ms   55s991ms    4.30B       5.75B    
> 5.88 GB        4.65 GB  LEFT OUTER JOIN, PARTITIONED
> |--31:EXCHANGE            19    1s731ms    1s864ms    1.29B       1.44B    
> 7.16 MB              0  HASH(d_year,ws_item_sk,ws_bill_customer_sk)
> |  30:AGGREGATE           19   41s146ms   49s094ms    1.29B       1.44B    
> 5.88 GB       88.31 GB  FINALIZE
> |  29:EXCHANGE            19    1s743ms    1s989ms    1.29B       1.44B   
> 13.53 MB              0  HASH(d_year,ws_item_sk,ws_bill_customer_sk)
> |  11:AGGREGATE           19   18s910ms   20s163ms    1.29B       1.44B  
> 128.06 MB       88.31 GB  STREAMING
> |  10:HASH JOIN           19    1s742ms    1s801ms    1.29B       1.44B    
> 1.99 MB        1.94 MB  INNER JOIN, BROADCAST
> |  |--28:EXCHANGE         19   15.043us   24.887us      365         373   
> 18.97 KB              0  BROADCAST
> |  |  08:SCAN HDFS         1   53.015ms   53.015ms      365         373    
> 1.13 MB       32.00 MB  tpcds_10000_decimal_parquet.date_dim
> |  09:HASH JOIN           19   22s658ms   26s425ms    1.29B       7.20B    
> 1.63 GB      636.07 MB  LEFT OUTER JOIN, PARTITIONED
> |  |--27:EXCHANGE         19  597.705ms  796.106ms  720.02M     720.02M   
> 10.50 MB              0  HASH(wr_item_sk,wr_order_number)
> |  |  07:SCAN HDFS        19  185.858ms  315.385ms  720.02M     720.02M   
> 98.54 MB       80.00 MB  tpcds_10000_decimal_parquet.web_returns
> |  26:EXCHANGE            19    1s530ms    2s739ms    1.43B       7.20B   
> 13.23 MB              0  HASH(ws_item_sk,ws_order_number)
> |  06:SCAN HDFS           19    1s589ms    2s720ms    1.43B       7.20B  
> 369.88 MB      528.00 MB  tpcds_10000_decimal_parquet.web_sales
> 25:AGGREGATE              19      1m35s      1m51s    4.30B       5.75B   
> 21.04 GB      353.23 GB  FINALIZE
> 24:EXCHANGE               19    4s893ms    5s539ms    4.31B       5.75B   
> 13.53 MB              0  HASH(d_year,ss_item_sk,ss_customer_sk)
> 05:AGGREGATE              19      2m40s       3m8s    4.31B       5.75B   
> 20.91 GB      353.23 GB  STREAMING
> 04:HASH JOIN              19    6s349ms    7s056ms    4.92B       5.75B    
> 1.99 MB        1.94 MB  INNER JOIN, BROADCAST
> |--23:EXCHANGE            19   15.995us   38.006us      365         373   
> 18.97 KB              0  BROADCAST
> |  02:SCAN HDFS            1   49.584ms   49.584ms      365         373    
> 1.13 MB       32.00 MB  tpcds_10000_decimal_parquet.date_dim
> 03:HASH JOIN              19      1m46s      2m20s    4.92B      28.80B    
> 6.41 GB        2.48 GB  LEFT OUTER JOIN, PARTITIONED
> |--22:EXCHANGE            19    2s227ms    3s111ms    2.88B       2.88B   
> 10.49 MB              0  HASH(sr_item_sk,sr_ticket_number)
> |  01:SCAN HDFS           19  611.931ms  686.818ms    2.88B       2.88B  
> 205.33 MB      176.00 MB  tpcds_10000_decimal_parquet.store_returns
> 21:EXCHANGE               19    5s953ms    7s325ms    5.47B      28.80B   
> 13.40 MB              0  HASH(ss_item_sk,ss_ticket_number)
> 00:SCAN HDFS              19    2s520ms    3s222ms    5.47B      28.80B  
> 492.66 MB      528.00 MB  tpcds_10000_decimal_parquet.store_sales
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to