[jira] [Commented] (IMPALA-2945) Pre-aggregation cardinality estimates do not take into account data distribution

David Rorke (Jira) Tue, 14 Jan 2025 11:22:06 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17913034#comment-17913034
 ]


David Rorke commented on IMPALA-2945:
-------------------------------------

A couple points on the Method 1 vs Method 2 comparison discussed above and in 
the review:

The big discrepancy in preagg output cardinality in the Q4 analysis above is 
largely due to a basic bug in the current implementation of 
PlanFragment.getPerInstanceNdvForCpuCosting().  We currently cap the return 
value at inputCardinality when it should be capped at 
perInstanceInputCardinality. If you adjust the result in the Q4 analysis above 
assuming this bug was fixed, Method 1 would compute an overall preagg output 
cardinality of 1,621,900,725 which is still slightly higher and less accurate 
than Method 2 but in the same range.

One limitation of using TPC-DS Q4 as an example for comparison is that the agg 
contains two columns with very high NDV values (c_customer_id and 
c_email_address) and so after the multiplication the values in either of the 
methods are massive and both will be subject to similar caps based on input 
cardinality. This makes it hard to judge the impact of changing the order of 
globalNdv calculation vs the probability based adjustment for duplicates.

I tried to create a different example with lower individual column NDVs and did 
some spreadsheet calculations to compare the two methods across a range of 
different input cardinalities.  I assumed 4 columns in the agg with the 
following per column NDVs:
{noformat}
col1: 100
col2: 20
col3: 10
col4: 3
Product of column NDVs: 60000
{noformat}
The preagg output cardinalities calculated by Method 1 and Method 2 for a range 
of different input cardinalities and these column NDVs is:
||Overall Agg Input Cardinality||Method 1||Method 2||
|1,659,515,219|7,200,000|7,200,000|
|165,951,521|7,200,000|7,200,000|
|16,595,152|7,200,000|6,481,666|
|1,659,515|1,659,515|1,482,161|
|165,951|165,951|164,054|
|16,595|16,595|16,595|
|1,659|1,659|1,659|

The main insights from this are that for this example, at higher input 
cardinalities both methods end up with output cardinalities equal to the 
product of the column NDVs * number of instances and for the lowest input 
cardinalities the caps kick in for both methods and the results are also equal 
in these cases. There are some cases in the middle where method 2 produces a 
slightly lower estimate but the differences even in these cases are modest.

My overall conclusion from this is that the ordering of the operations in 
method 1 and method 2 doesn't have a very large impact on the estimated output 
cardinalities and both methods are effective at applying the probability based 
adjustment.

> Pre-aggregation cardinality estimates do not take into account data 
> distribution
> --------------------------------------------------------------------------------
>
>                 Key: IMPALA-2945
>                 URL: https://issues.apache.org/jira/browse/IMPALA-2945
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 2.0, Impala 2.3.0, Impala 2.5.0, Impala 2.4.0, 
> Impala 2.6.0, Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala 
> 2.11.0, Impala 2.12.0
>            Reporter: Mostafa Mokhtar
>            Assignee: Riza Suminto
>            Priority: Major
>              Labels: planner, resource-management
>         Attachments: ESTIMATE_DUPLICATE_IN_PREAGG_FALSE.txt, 
> ESTIMATE_DUPLICATE_IN_PREAGG_TRUE.txt, tpcds_q4_baseline.txt
>
>
> When computing the per-host memory estimate for local aggregations, the 
> planner does not take into account that data is randomly distributed across 
> nodes leading to significant underestimation in some cases. The suggested fix 
> is to use min(agg input cardinality, NDV * #hosts) as the per-node 
> cardinality used for the per-node memory estimate.
> Impact: In the query below, the planner significantly underestimates the 
> per-node memory of agg node 03 to be 3.8GB but the actual is 24.77.
> Query
> {code}
> select sum(l_extendedprice) / 7.0 as avg_yearly
> from
>   lineitem,
>   part
> where
>   p_partkey = l_partkey
>   and p_brand = 'Brand#23'
>   and p_container = 'MED BOX'
>   and l_quantity < (
>     select
>       0.2 * avg(l_quantity)
>     from
>       lineitem
>     where
>       l_partkey = p_partkey
>   )
> {code}
> Plan
> {code}
> 12:AGGREGATE [FINALIZE]
> |  output: sum:merge(l_extendedprice)
> |  hosts=20 per-host-mem=unavailable
> |  tuple-ids=6 row-size=16B cardinality=1
> |
> 11:EXCHANGE [UNPARTITIONED]
> |  hosts=20 per-host-mem=unavailable
> |  tuple-ids=6 row-size=16B cardinality=1
> |
> 06:AGGREGATE
> |  output: sum(l_extendedprice)
> |  hosts=20 per-host-mem=10.00MB
> |  tuple-ids=6 row-size=16B cardinality=1
> |
> 05:HASH JOIN [RIGHT SEMI JOIN, PARTITIONED]
> |  hash predicates: l_partkey = p_partkey
> |  other join predicates: l_quantity < 0.2 * avg(l_quantity)
> |  hosts=20 per-host-mem=125.18MB
> |  tuple-ids=0,1 row-size=80B cardinality=29992141
> |
> |--10:EXCHANGE [HASH(p_partkey)]
> |  |  hosts=20 per-host-mem=0B
> |  |  tuple-ids=0,1 row-size=80B cardinality=29992141
> |  |
> |  04:HASH JOIN [INNER JOIN, BROADCAST]
> |  |  hash predicates: l_partkey = p_partkey
> |  |  hosts=20 per-host-mem=58.30MB
> |  |  tuple-ids=0,1 row-size=80B cardinality=29992141
> |  |
> |  |--09:EXCHANGE [BROADCAST]
> |  |  |  hosts=20 per-host-mem=0B
> |  |  |  tuple-ids=1 row-size=56B cardinality=1000000
> |  |  |
> |  |  01:SCAN HDFS [tpch_1000_decimal_parquet.part, RANDOM]
> |  |     partitions=1/1 files=40 size=6.38GB
> |  |     predicates: p_brand = 'Brand#23', p_container = 'MED BOX'
> |  |     table stats: 200000000 rows total
> |  |     column stats: all
> |  |     hosts=20 per-host-mem=264.00MB
> |  |     tuple-ids=1 row-size=56B cardinality=1000000
> |  |
> |  00:SCAN HDFS [tpch_1000_decimal_parquet.lineitem, RANDOM]
> |     partitions=1/1 files=880 size=216.61GB
> |     table stats: 5999989709 rows total
> |     column stats: all
> |     hosts=20 per-host-mem=264.00MB
> |     tuple-ids=0 row-size=24B cardinality=5999989709
> |
> 08:AGGREGATE [FINALIZE]
> |  output: avg:merge(l_quantity)
> |  group by: l_partkey
> |  hosts=20 per-host-mem=167.89MB
> |  tuple-ids=4 row-size=16B cardinality=200052064
> |
> 07:EXCHANGE [HASH(l_partkey)]
> |  hosts=20 per-host-mem=0B
> |  tuple-ids=3 row-size=16B cardinality=200052064
> |
> 03:AGGREGATE
> |  output: avg(l_quantity)
> |  group by: l_partkey
> |  hosts=20 per-host-mem=3.28GB
> |  tuple-ids=3 row-size=16B cardinality=200052064
> |
> 02:SCAN HDFS [tpch_1000_decimal_parquet.lineitem, RANDOM]
>    partitions=1/1 files=880 size=216.61GB
>    table stats: 5999989709 rows total
>    column stats: all
>    hosts=20 per-host-mem=176.00MB
>    tuple-ids=2 row-size=16B cardinality=5999989709
> {code}
> Summary
> |Operator             ||#Hosts||   Avg Time||   Max Time||    #Rows  ||Est. 
> #Rows||   Peak Mem  ||Est. Peak Mem  ||Detail                   |      
> |12:AGGREGATE              |1  |256.620ms|  256.620ms|        1|           1| 
>   92.00 KB|        -1.00 B|  FINALIZE                       |
> |11:EXCHANGE               |1  |184.430us|  184.430us|       20|           1| 
>          0|        -1.00 B|  UNPARTITIONED                  |
> |06:AGGREGATE             |20  |364.045ms|    1s508ms|       20|           1| 
>    9.37 MB|       10.00 MB|                                 |
> |05:HASH JOIN             |20  |279.175ms|  304.600ms|  523.09K|      29.99M| 
>  155.04 MB|      125.18 MB|  RIGHT SEMI JOIN, PARTITIONED   |
> |I--10:EXCHANGE           |20   |22.448ms|   32.954ms|    5.98M|      29.99M| 
>          0|              0|  HASH(p_partkey)                |
> |I  04:HASH JOIN          |20   |25s417ms|   35s579ms|    5.98M|      29.99M| 
>  146.02 MB|       58.30 MB|  INNER JOIN, BROADCAST          |
> |I  I--09:EXCHANGE        |20   |16.270ms|   35.329ms|  199.30K|       1.00M| 
>          0|              0|  BROADCAST                     | 
> |I  I  01:SCAN HDFS       |20  |218.505ms|  331.299ms|  199.30K|       1.00M| 
>  173.43 MB|      264.00 MB|  tpch_1000_decimal_parquet.part| 
> |I  00:SCAN HDFS          |20    |1s365ms|    1s822ms|    6.00B|       6.00B| 
>    1.92 GB|      264.00 MB|  tpch_1000_decimal_parquet.l... |
> |08:AGGREGATE             |20   |29s706ms|   35s917ms|  200.00M|     200.05M| 
>    1.64 GB|      167.89 MB|  FINALIZE|                       
> |07:EXCHANGE              |20    |5s081ms|    8s410ms|    3.11B|     200.05M| 
>          0|              0|  HASH(l_partkey)|                
> |03:AGGREGATE             |20      |4m10s      5m12s |   3.11B|     200.05M|  
> 24.77 GB|        3.28 GB|                                 
> |02:SCAN HDFS             |20    |1s544ms|    2s517ms|    6.00B|       6.00B| 
>  838.85 MB|      176.00 MB|  tpch_1000_decimal_parquet.l... |



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-2945) Pre-aggregation cardinality estimates do not take into account data distribution

Reply via email to