Aman Sinha created IMPALA-12168:
-----------------------------------
Summary: Aggregate's cardinality overestimated for a count
distinct query with grouping on same column
Key: IMPALA-12168
URL: https://issues.apache.org/jira/browse/IMPALA-12168
Project: IMPALA
Issue Type: Bug
Components: Frontend
Reporter: Aman Sinha
When the COUNT(DISTINCT col) is on the same column as the GROUP BY, we
currently overestimate the cardinality:
{noformat}
[localhost:21050] tpch> explain select l_shipdate, l_quantity, count(distinct
l_shipdate) from lineitem group by l_shipdate, l_quantity;
Query: explain select l_shipdate, l_quantity, count(distinct l_shipdate) from
lineitem group by l_shipdate, l_quantity
+-------------------------------------------------------------+
| Explain String |
+-------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=89.75MB Threads=5 |
| Per-Host Resource Estimates: Memory=517MB |
| |
| PLAN-ROOT SINK |
| | |
| 07:EXCHANGE [UNPARTITIONED] |
| | |
| 06:AGGREGATE [FINALIZE] |
| | output: count:merge(l_shipdate) |
| | group by: l_shipdate, l_quantity |
| | row-size=38B cardinality=134.08K |
| | |
| 05:EXCHANGE [HASH(l_shipdate,l_quantity)] |
| | |
| 02:AGGREGATE [STREAMING] |
| | output: count(l_shipdate) |
| | group by: l_shipdate, l_quantity |
| | row-size=38B cardinality=134.08K |
| | |
| 04:AGGREGATE |
| | group by: l_shipdate, l_quantity, l_shipdate |
| | row-size=52B cardinality=6.00M |
| | |
| 03:EXCHANGE [HASH(l_shipdate,l_quantity,l_shipdate)] |
| | |
| 01:AGGREGATE [STREAMING] |
| | group by: l_shipdate, l_quantity, l_shipdate |
| | row-size=52B cardinality=6.00M |
| | |
| 00:SCAN HDFS [tpch.lineitem] |
| HDFS partitions=1/1 files=1 size=718.94MB |
| row-size=30B cardinality=6.00M |
+-------------------------------------------------------------+
{noformat}
Here the lower Streaming Agg's cardinality is 6M rows whereas the actual
rowcount for that phase is 376K:
{noformat}
01:AGGREGATE 3 3 327.441ms 350.542ms 376.80K 6.00M
36.11 MB 91.57 MB STREAMING
{noformat}
The NDV column stats for this table:
{noformat}
[localhost:21050] tpch> show column stats lineitem;
Query: show column stats lineitem
+-----------------+---------------+------------------+--------+----------+---------------+--------+---------+
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg
Size | #Trues | #Falses |
+-----------------+---------------+------------------+--------+----------+---------------+--------+---------+
| l_orderkey | BIGINT | 1563438 | 0 | 8 | 8.0
| -1 | -1 |
| l_partkey | BIGINT | 200516 | 0 | 8 | 8.0
| -1 | -1 |
| l_suppkey | BIGINT | 9712 | 0 | 8 | 8.0
| -1 | -1 |
| l_linenumber | INT | 7 | 0 | 4 | 4.0
| -1 | -1 |
| l_quantity | DECIMAL(12,2) | 51 | 0 | 8 | 8.0
| -1 | -1 |
| l_extendedprice | DECIMAL(12,2) | 868550 | 0 | 8 | 8.0
| -1 | -1 |
| l_discount | DECIMAL(12,2) | 11 | 0 | 8 | 8.0
| -1 | -1 |
| l_tax | DECIMAL(12,2) | 9 | 0 | 8 | 8.0
| -1 | -1 |
| l_returnflag | STRING | 3 | 0 | 1 | 1.0
| -1 | -1 |
| l_linestatus | STRING | 2 | 0 | 1 | 1.0
| -1 | -1 |
| l_shipdate | STRING | 2629 | 0 | 10 | 10.0
| -1 | -1 |
| l_commitdate | STRING | 2559 | 0 | 10 | 10.0
| -1 | -1 |
| l_receiptdate | STRING | 2658 | 0 | 10 | 10.0
| -1 | -1 |
| l_shipinstruct | STRING | 4 | 0 | 17 |
11.9986381531 | -1 | -1 |
| l_shipmode | STRING | 7 | 0 | 7 |
4.28530454636 | -1 | -1 |
| l_comment | STRING | 4652621 | 0 | 43 |
26.4941692352 | -1 | -1 |
+-----------------+---------------+------------------+--------+----------+---------------+--------+---------+
{noformat}
The reason for this overestimation is the group by exprs contain duplicate
column l_shipdate:
{noformat}
group by: l_shipdate, l_quantity, l_shipdate
{noformat}
Due to this extra duplicate column, we compute the cardinality as
NDV(l_shipdate) * NDV(l_quantity) * NDV(l_shipdate) = 352M which is then
capped at 6M since the child produces 6M rows.
The right estimate should be NDV(l_shipdate) * NDV(l_quantity) = 134K
--
This message was sent by Atlassian Jira
(v8.20.10#820010)