[jira] [Created] (IMPALA-13077) Equality predicate on partition column and uncorrelated subquery doesn't reduce the cardinality estimate

Quanlong Huang (Jira) Mon, 13 May 2024 18:57:07 -0700

Quanlong Huang created IMPALA-13077:
---------------------------------------


             Summary: Equality predicate on partition column and uncorrelated 
subquery doesn't reduce the cardinality estimate
                 Key: IMPALA-13077
                 URL: https://issues.apache.org/jira/browse/IMPALA-13077
             Project: IMPALA
          Issue Type: Bug
          Components: Frontend
            Reporter: Quanlong Huang


Let's say 'part_tbl' is a partitioned table. Its partition key is 'part_key'. 
Consider the following query:
{code:sql}
select xxx from part_tbl
where part_key=(select ... from dim_tbl);
{code}
Its query plan is a JoinNode with two ScanNodes. When estimating the 
cardinality of the JoinNode, the planner is not aware that 'part_key' is the 
partition column and the cardinality of the JoinNode should not be larger than 
the max row count across partitions.

The recent work in IMPALA-12018 (Consider runtime filter for cardinality 
reduction) helps in some cases since there are runtime filters on the partition 
column. But there are still some cases that we overestimate the cardinality. 
For instance, 'ss_sold_date_sk' is the only partition key of tpcds.store_sales. 
The following query
{code:sql}
select count(*) from tpcds.store_sales
where ss_sold_date_sk=(
  select min(d_date_sk) + 1000 from tpcds.date_dim);{code}
has query plan:
{noformat}
+-------------------------------------------------------------+
| Explain String                                              |
+-------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=18.94MB Threads=6 |
| Per-Host Resource Estimates: Memory=243MB                   |
|                                                             |
| PLAN-ROOT SINK                                              |
| |                                                           |
| 09:AGGREGATE [FINALIZE]                                     |
| |  output: count:merge(*)                                   |
| |  row-size=8B cardinality=1                                |
| |                                                           |
| 08:EXCHANGE [UNPARTITIONED]                                 |
| |                                                           |
| 04:AGGREGATE                                                |
| |  output: count(*)                                         |
| |  row-size=8B cardinality=1                                |
| |                                                           |
| 03:HASH JOIN [LEFT SEMI JOIN, BROADCAST]                    |
| |  hash predicates: ss_sold_date_sk = min(d_date_sk) + 1000 |
| |  runtime filters: RF000 <- min(d_date_sk) + 1000          |
| |  row-size=4B cardinality=2.88M <---- Should be max(numRows) across 
partitions
| |                                                           |
| |--07:EXCHANGE [BROADCAST]                                  |
| |  |                                                        |
| |  06:AGGREGATE [FINALIZE]                                  |
| |  |  output: min:merge(d_date_sk)                          |
| |  |  row-size=4B cardinality=1                             |
| |  |                                                        |
| |  05:EXCHANGE [UNPARTITIONED]                              |
| |  |                                                        |
| |  02:AGGREGATE                                             |
| |  |  output: min(d_date_sk)                                |
| |  |  row-size=4B cardinality=1                             |
| |  |                                                        |
| |  01:SCAN HDFS [tpcds.date_dim]                            |
| |     HDFS partitions=1/1 files=1 size=9.84MB               |
| |     row-size=4B cardinality=73.05K                        |
| |                                                           |
| 00:SCAN HDFS [tpcds.store_sales]                            |
|    HDFS partitions=1824/1824 files=1824 size=346.60MB       |
|    runtime filters: RF000 -> ss_sold_date_sk                |
|    row-size=4B cardinality=2.88M                            |
+-------------------------------------------------------------+{noformat}
CC [~boroknagyz], [~rizaon]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IMPALA-13077) Equality predicate on partition column and uncorrelated subquery doesn't reduce the cardinality estimate

Reply via email to