[jira] [Resolved] (IMPALA-13077) Equality predicate on partition column and uncorrelated subquery doesn't reduce the cardinality estimate

Riza Suminto (Jira) Thu, 01 Aug 2024 15:32:07 -0700


     [ 
https://issues.apache.org/jira/browse/IMPALA-13077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Riza Suminto resolved IMPALA-13077.
-----------------------------------
     Fix Version/s: Impala 4.5.0
    Target Version: Impala 4.5.0
        Resolution: Fixed

It will be great if we can use max(numRows) across all partitions for this 
subquery case. While not exactly the same, setting correct NDV through 
ColumnStats.fromExpr() seems to correct the cardinality estimation in some 
degree. Therefore, I'm resolving this Jira as Fixed.

> Equality predicate on partition column and uncorrelated subquery doesn't 
> reduce the cardinality estimate
> --------------------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-13077
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13077
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>            Reporter: Quanlong Huang
>            Assignee: Riza Suminto
>            Priority: Critical
>             Fix For: Impala 4.5.0
>
>
> Let's say 'part_tbl' is a partitioned table. Its partition key is 'part_key'. 
> Consider the following query:
> {code:sql}
> select xxx from part_tbl
> where part_key=(select ... from dim_tbl);
> {code}
> Its query plan is a JoinNode with two ScanNodes. When estimating the 
> cardinality of the JoinNode, the planner is not aware that 'part_key' is the 
> partition column and the cardinality of the JoinNode should not be larger 
> than the max row count across partitions.
> The recent work in IMPALA-12018 (Consider runtime filter for cardinality 
> reduction) helps in some cases since there are runtime filters on the 
> partition column. But there are still some cases that we overestimate the 
> cardinality. For instance, 'ss_sold_date_sk' is the only partition key of 
> tpcds.store_sales. The following query
> {code:sql}
> select count(*) from tpcds.store_sales
> where ss_sold_date_sk=(
>   select min(d_date_sk) + 1000 from tpcds.date_dim);{code}
> has query plan:
> {noformat}
> +-------------------------------------------------------------+
> | Explain String                                              |
> +-------------------------------------------------------------+
> | Max Per-Host Resource Reservation: Memory=18.94MB Threads=6 |
> | Per-Host Resource Estimates: Memory=243MB                   |
> |                                                             |
> | PLAN-ROOT SINK                                              |
> | |                                                           |
> | 09:AGGREGATE [FINALIZE]                                     |
> | |  output: count:merge(*)                                   |
> | |  row-size=8B cardinality=1                                |
> | |                                                           |
> | 08:EXCHANGE [UNPARTITIONED]                                 |
> | |                                                           |
> | 04:AGGREGATE                                                |
> | |  output: count(*)                                         |
> | |  row-size=8B cardinality=1                                |
> | |                                                           |
> | 03:HASH JOIN [LEFT SEMI JOIN, BROADCAST]                    |
> | |  hash predicates: ss_sold_date_sk = min(d_date_sk) + 1000 |
> | |  runtime filters: RF000 <- min(d_date_sk) + 1000          |
> | |  row-size=4B cardinality=2.88M <---- Should be max(numRows) across 
> partitions
> | |                                                           |
> | |--07:EXCHANGE [BROADCAST]                                  |
> | |  |                                                        |
> | |  06:AGGREGATE [FINALIZE]                                  |
> | |  |  output: min:merge(d_date_sk)                          |
> | |  |  row-size=4B cardinality=1                             |
> | |  |                                                        |
> | |  05:EXCHANGE [UNPARTITIONED]                              |
> | |  |                                                        |
> | |  02:AGGREGATE                                             |
> | |  |  output: min(d_date_sk)                                |
> | |  |  row-size=4B cardinality=1                             |
> | |  |                                                        |
> | |  01:SCAN HDFS [tpcds.date_dim]                            |
> | |     HDFS partitions=1/1 files=1 size=9.84MB               |
> | |     row-size=4B cardinality=73.05K                        |
> | |                                                           |
> | 00:SCAN HDFS [tpcds.store_sales]                            |
> |    HDFS partitions=1824/1824 files=1824 size=346.60MB       |
> |    runtime filters: RF000 -> ss_sold_date_sk                |
> |    row-size=4B cardinality=2.88M                            |
> +-------------------------------------------------------------+{noformat}
> CC [~boroknagyz], [~rizaon]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (IMPALA-13077) Equality predicate on partition column and uncorrelated subquery doesn't reduce the cardinality estimate

Reply via email to