[
https://issues.apache.org/jira/browse/IMPALA-12168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aman Sinha reassigned IMPALA-12168:
-----------------------------------
Assignee: Aman Sinha
> Aggregate's cardinality overestimated for a count distinct query with
> grouping on same column
> ---------------------------------------------------------------------------------------------
>
> Key: IMPALA-12168
> URL: https://issues.apache.org/jira/browse/IMPALA-12168
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Reporter: Aman Sinha
> Assignee: Aman Sinha
> Priority: Major
>
> When the COUNT(DISTINCT col) is on the same column as the GROUP BY, we
> currently overestimate the cardinality:
> {noformat}
> [localhost:21050] tpch> explain select l_shipdate, l_quantity, count(distinct
> l_shipdate) from lineitem group by l_shipdate, l_quantity;
> Query: explain select l_shipdate, l_quantity, count(distinct l_shipdate) from
> lineitem group by l_shipdate, l_quantity
> +-------------------------------------------------------------+
> | Explain String |
> +-------------------------------------------------------------+
> | Max Per-Host Resource Reservation: Memory=89.75MB Threads=5 |
> | Per-Host Resource Estimates: Memory=517MB |
> | |
> | PLAN-ROOT SINK |
> | | |
> | 07:EXCHANGE [UNPARTITIONED] |
> | | |
> | 06:AGGREGATE [FINALIZE] |
> | | output: count:merge(l_shipdate) |
> | | group by: l_shipdate, l_quantity |
> | | row-size=38B cardinality=134.08K |
> | | |
> | 05:EXCHANGE [HASH(l_shipdate,l_quantity)] |
> | | |
> | 02:AGGREGATE [STREAMING] |
> | | output: count(l_shipdate) |
> | | group by: l_shipdate, l_quantity |
> | | row-size=38B cardinality=134.08K |
> | | |
> | 04:AGGREGATE |
> | | group by: l_shipdate, l_quantity, l_shipdate |
> | | row-size=52B cardinality=6.00M |
> | | |
> | 03:EXCHANGE [HASH(l_shipdate,l_quantity,l_shipdate)] |
> | | |
> | 01:AGGREGATE [STREAMING] |
> | | group by: l_shipdate, l_quantity, l_shipdate |
> | | row-size=52B cardinality=6.00M |
> | | |
> | 00:SCAN HDFS [tpch.lineitem] |
> | HDFS partitions=1/1 files=1 size=718.94MB |
> | row-size=30B cardinality=6.00M |
> +-------------------------------------------------------------+
> {noformat}
> Here the lower Streaming Agg's cardinality is 6M rows whereas the actual
> rowcount for that phase is 376K:
> {noformat}
> 01:AGGREGATE 3 3 327.441ms 350.542ms 376.80K
> 6.00M 36.11 MB 91.57 MB STREAMING
> {noformat}
> The NDV column stats for this table:
> {noformat}
> [localhost:21050] tpch> show column stats lineitem;
> Query: show column stats lineitem
> +-----------------+---------------+------------------+--------+----------+---------------+--------+---------+
> | Column | Type | #Distinct Values | #Nulls | Max Size |
> Avg Size | #Trues | #Falses |
> +-----------------+---------------+------------------+--------+----------+---------------+--------+---------+
> | l_orderkey | BIGINT | 1563438 | 0 | 8 |
> 8.0 | -1 | -1 |
> | l_partkey | BIGINT | 200516 | 0 | 8 |
> 8.0 | -1 | -1 |
> | l_suppkey | BIGINT | 9712 | 0 | 8 |
> 8.0 | -1 | -1 |
> | l_linenumber | INT | 7 | 0 | 4 |
> 4.0 | -1 | -1 |
> | l_quantity | DECIMAL(12,2) | 51 | 0 | 8 |
> 8.0 | -1 | -1 |
> | l_extendedprice | DECIMAL(12,2) | 868550 | 0 | 8 |
> 8.0 | -1 | -1 |
> | l_discount | DECIMAL(12,2) | 11 | 0 | 8 |
> 8.0 | -1 | -1 |
> | l_tax | DECIMAL(12,2) | 9 | 0 | 8 |
> 8.0 | -1 | -1 |
> | l_returnflag | STRING | 3 | 0 | 1 |
> 1.0 | -1 | -1 |
> | l_linestatus | STRING | 2 | 0 | 1 |
> 1.0 | -1 | -1 |
> | l_shipdate | STRING | 2629 | 0 | 10 |
> 10.0 | -1 | -1 |
> | l_commitdate | STRING | 2559 | 0 | 10 |
> 10.0 | -1 | -1 |
> | l_receiptdate | STRING | 2658 | 0 | 10 |
> 10.0 | -1 | -1 |
> | l_shipinstruct | STRING | 4 | 0 | 17 |
> 11.9986381531 | -1 | -1 |
> | l_shipmode | STRING | 7 | 0 | 7 |
> 4.28530454636 | -1 | -1 |
> | l_comment | STRING | 4652621 | 0 | 43 |
> 26.4941692352 | -1 | -1 |
> +-----------------+---------------+------------------+--------+----------+---------------+--------+---------+
> {noformat}
> The reason for this overestimation is the group by exprs contain duplicate
> column l_shipdate:
> {noformat}
> group by: l_shipdate, l_quantity, l_shipdate
> {noformat}
> Due to this extra duplicate column, we compute the cardinality as
> NDV(l_shipdate) * NDV(l_quantity) * NDV(l_shipdate) = 352M which is then
> capped at 6M since the child produces 6M rows.
> The right estimate should be NDV(l_shipdate) * NDV(l_quantity) = 134K
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]