[
https://issues.apache.org/jira/browse/IMPALA-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17913034#comment-17913034
]
David Rorke commented on IMPALA-2945:
-------------------------------------
A couple points on the Method 1 vs Method 2 comparison discussed above and in
the review:
The big discrepancy in preagg output cardinality in the Q4 analysis above is
largely due to a basic bug in the current implementation of
PlanFragment.getPerInstanceNdvForCpuCosting(). We currently cap the return
value at inputCardinality when it should be capped at
perInstanceInputCardinality. If you adjust the result in the Q4 analysis above
assuming this bug was fixed, Method 1 would compute an overall preagg output
cardinality of 1,621,900,725 which is still slightly higher and less accurate
than Method 2 but in the same range.
One limitation of using TPC-DS Q4 as an example for comparison is that the agg
contains two columns with very high NDV values (c_customer_id and
c_email_address) and so after the multiplication the values in either of the
methods are massive and both will be subject to similar caps based on input
cardinality. This makes it hard to judge the impact of changing the order of
globalNdv calculation vs the probability based adjustment for duplicates.
I tried to create a different example with lower individual column NDVs and did
some spreadsheet calculations to compare the two methods across a range of
different input cardinalities. I assumed 4 columns in the agg with the
following per column NDVs:
{noformat}
col1: 100
col2: 20
col3: 10
col4: 3
Product of column NDVs: 60000
{noformat}
The preagg output cardinalities calculated by Method 1 and Method 2 for a range
of different input cardinalities and these column NDVs is:
||Overall Agg Input Cardinality||Method 1||Method 2||
|1,659,515,219|7,200,000|7,200,000|
|165,951,521|7,200,000|7,200,000|
|16,595,152|7,200,000|6,481,666|
|1,659,515|1,659,515|1,482,161|
|165,951|165,951|164,054|
|16,595|16,595|16,595|
|1,659|1,659|1,659|
The main insights from this are that for this example, at higher input
cardinalities both methods end up with output cardinalities equal to the
product of the column NDVs * number of instances and for the lowest input
cardinalities the caps kick in for both methods and the results are also equal
in these cases. There are some cases in the middle where method 2 produces a
slightly lower estimate but the differences even in these cases are modest.
My overall conclusion from this is that the ordering of the operations in
method 1 and method 2 doesn't have a very large impact on the estimated output
cardinalities and both methods are effective at applying the probability based
adjustment.
> Pre-aggregation cardinality estimates do not take into account data
> distribution
> --------------------------------------------------------------------------------
>
> Key: IMPALA-2945
> URL: https://issues.apache.org/jira/browse/IMPALA-2945
> Project: IMPALA
> Issue Type: Improvement
> Components: Frontend
> Affects Versions: Impala 2.0, Impala 2.3.0, Impala 2.5.0, Impala 2.4.0,
> Impala 2.6.0, Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala
> 2.11.0, Impala 2.12.0
> Reporter: Mostafa Mokhtar
> Assignee: Riza Suminto
> Priority: Major
> Labels: planner, resource-management
> Attachments: ESTIMATE_DUPLICATE_IN_PREAGG_FALSE.txt,
> ESTIMATE_DUPLICATE_IN_PREAGG_TRUE.txt, tpcds_q4_baseline.txt
>
>
> When computing the per-host memory estimate for local aggregations, the
> planner does not take into account that data is randomly distributed across
> nodes leading to significant underestimation in some cases. The suggested fix
> is to use min(agg input cardinality, NDV * #hosts) as the per-node
> cardinality used for the per-node memory estimate.
> Impact: In the query below, the planner significantly underestimates the
> per-node memory of agg node 03 to be 3.8GB but the actual is 24.77.
> Query
> {code}
> select sum(l_extendedprice) / 7.0 as avg_yearly
> from
> lineitem,
> part
> where
> p_partkey = l_partkey
> and p_brand = 'Brand#23'
> and p_container = 'MED BOX'
> and l_quantity < (
> select
> 0.2 * avg(l_quantity)
> from
> lineitem
> where
> l_partkey = p_partkey
> )
> {code}
> Plan
> {code}
> 12:AGGREGATE [FINALIZE]
> | output: sum:merge(l_extendedprice)
> | hosts=20 per-host-mem=unavailable
> | tuple-ids=6 row-size=16B cardinality=1
> |
> 11:EXCHANGE [UNPARTITIONED]
> | hosts=20 per-host-mem=unavailable
> | tuple-ids=6 row-size=16B cardinality=1
> |
> 06:AGGREGATE
> | output: sum(l_extendedprice)
> | hosts=20 per-host-mem=10.00MB
> | tuple-ids=6 row-size=16B cardinality=1
> |
> 05:HASH JOIN [RIGHT SEMI JOIN, PARTITIONED]
> | hash predicates: l_partkey = p_partkey
> | other join predicates: l_quantity < 0.2 * avg(l_quantity)
> | hosts=20 per-host-mem=125.18MB
> | tuple-ids=0,1 row-size=80B cardinality=29992141
> |
> |--10:EXCHANGE [HASH(p_partkey)]
> | | hosts=20 per-host-mem=0B
> | | tuple-ids=0,1 row-size=80B cardinality=29992141
> | |
> | 04:HASH JOIN [INNER JOIN, BROADCAST]
> | | hash predicates: l_partkey = p_partkey
> | | hosts=20 per-host-mem=58.30MB
> | | tuple-ids=0,1 row-size=80B cardinality=29992141
> | |
> | |--09:EXCHANGE [BROADCAST]
> | | | hosts=20 per-host-mem=0B
> | | | tuple-ids=1 row-size=56B cardinality=1000000
> | | |
> | | 01:SCAN HDFS [tpch_1000_decimal_parquet.part, RANDOM]
> | | partitions=1/1 files=40 size=6.38GB
> | | predicates: p_brand = 'Brand#23', p_container = 'MED BOX'
> | | table stats: 200000000 rows total
> | | column stats: all
> | | hosts=20 per-host-mem=264.00MB
> | | tuple-ids=1 row-size=56B cardinality=1000000
> | |
> | 00:SCAN HDFS [tpch_1000_decimal_parquet.lineitem, RANDOM]
> | partitions=1/1 files=880 size=216.61GB
> | table stats: 5999989709 rows total
> | column stats: all
> | hosts=20 per-host-mem=264.00MB
> | tuple-ids=0 row-size=24B cardinality=5999989709
> |
> 08:AGGREGATE [FINALIZE]
> | output: avg:merge(l_quantity)
> | group by: l_partkey
> | hosts=20 per-host-mem=167.89MB
> | tuple-ids=4 row-size=16B cardinality=200052064
> |
> 07:EXCHANGE [HASH(l_partkey)]
> | hosts=20 per-host-mem=0B
> | tuple-ids=3 row-size=16B cardinality=200052064
> |
> 03:AGGREGATE
> | output: avg(l_quantity)
> | group by: l_partkey
> | hosts=20 per-host-mem=3.28GB
> | tuple-ids=3 row-size=16B cardinality=200052064
> |
> 02:SCAN HDFS [tpch_1000_decimal_parquet.lineitem, RANDOM]
> partitions=1/1 files=880 size=216.61GB
> table stats: 5999989709 rows total
> column stats: all
> hosts=20 per-host-mem=176.00MB
> tuple-ids=2 row-size=16B cardinality=5999989709
> {code}
> Summary
> |Operator ||#Hosts|| Avg Time|| Max Time|| #Rows ||Est.
> #Rows|| Peak Mem ||Est. Peak Mem ||Detail |
> |12:AGGREGATE |1 |256.620ms| 256.620ms| 1| 1|
> 92.00 KB| -1.00 B| FINALIZE |
> |11:EXCHANGE |1 |184.430us| 184.430us| 20| 1|
> 0| -1.00 B| UNPARTITIONED |
> |06:AGGREGATE |20 |364.045ms| 1s508ms| 20| 1|
> 9.37 MB| 10.00 MB| |
> |05:HASH JOIN |20 |279.175ms| 304.600ms| 523.09K| 29.99M|
> 155.04 MB| 125.18 MB| RIGHT SEMI JOIN, PARTITIONED |
> |I--10:EXCHANGE |20 |22.448ms| 32.954ms| 5.98M| 29.99M|
> 0| 0| HASH(p_partkey) |
> |I 04:HASH JOIN |20 |25s417ms| 35s579ms| 5.98M| 29.99M|
> 146.02 MB| 58.30 MB| INNER JOIN, BROADCAST |
> |I I--09:EXCHANGE |20 |16.270ms| 35.329ms| 199.30K| 1.00M|
> 0| 0| BROADCAST |
> |I I 01:SCAN HDFS |20 |218.505ms| 331.299ms| 199.30K| 1.00M|
> 173.43 MB| 264.00 MB| tpch_1000_decimal_parquet.part|
> |I 00:SCAN HDFS |20 |1s365ms| 1s822ms| 6.00B| 6.00B|
> 1.92 GB| 264.00 MB| tpch_1000_decimal_parquet.l... |
> |08:AGGREGATE |20 |29s706ms| 35s917ms| 200.00M| 200.05M|
> 1.64 GB| 167.89 MB| FINALIZE|
> |07:EXCHANGE |20 |5s081ms| 8s410ms| 3.11B| 200.05M|
> 0| 0| HASH(l_partkey)|
> |03:AGGREGATE |20 |4m10s 5m12s | 3.11B| 200.05M|
> 24.77 GB| 3.28 GB|
> |02:SCAN HDFS |20 |1s544ms| 2s517ms| 6.00B| 6.00B|
> 838.85 MB| 176.00 MB| tpch_1000_decimal_parquet.l... |
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]