[ 
https://issues.apache.org/jira/browse/IMPALA-13077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17853842#comment-17853842
 ] 

Riza Suminto commented on IMPALA-13077:
---------------------------------------

Looks like this is a bug in terms of calculating lhsNdv and rhsNdv. In current 
code, if either Ndv or Cardinality of equality expression is unknown (-1), 
getSemiJoinCardinality will skip that expression.

[https://github.com/apache/impala/blob/e7dac008bbafb20e4c7d15d46fbbbb2bac9a757f/fe/src/main/java/org/apache/impala/planner/JoinNode.java#L720-L726]
 

If Ndv is unknown, but Cardinality is known, that code should assume 
Cardinality as Ndv instead. I test that hack and confirmed that it lower the 
join cardinality through LOG.
{code:java}
I0610 17:09:25.739796 3972670 JoinNode.java:719] 
774dd75ed2b1fc53:c78b86b200000000] eqJoinConjuncts_.size=1
I0610 17:09:25.739863 3972670 JoinNode.java:755] 
774dd75ed2b1fc53:c78b86b200000000] getSemiJoinCardinality calculate selectivity 
for (ss_sold_date_sk = min(d_date_sk)) as 5.482456140350877E-4
I0610 17:09:25.739918 3972670 JoinNode.java:760] 
774dd75ed2b1fc53:c78b86b200000000] getSemiJoinCardinality has 
minSelectivity=5.482456140350877E-4
I0610 17:09:25.739933 3972670 JoinNode.java:762] 
774dd75ed2b1fc53:c78b86b200000000] Changed cardinality from 2880404 to 1579
I0610 17:09:25.739966 3972670 JoinNode.java:866] 
774dd75ed2b1fc53:c78b86b200000000] stats Join: cardinality=1579{code}

> Equality predicate on partition column and uncorrelated subquery doesn't 
> reduce the cardinality estimate
> --------------------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-13077
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13077
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>            Reporter: Quanlong Huang
>            Assignee: Quanlong Huang
>            Priority: Critical
>
> Let's say 'part_tbl' is a partitioned table. Its partition key is 'part_key'. 
> Consider the following query:
> {code:sql}
> select xxx from part_tbl
> where part_key=(select ... from dim_tbl);
> {code}
> Its query plan is a JoinNode with two ScanNodes. When estimating the 
> cardinality of the JoinNode, the planner is not aware that 'part_key' is the 
> partition column and the cardinality of the JoinNode should not be larger 
> than the max row count across partitions.
> The recent work in IMPALA-12018 (Consider runtime filter for cardinality 
> reduction) helps in some cases since there are runtime filters on the 
> partition column. But there are still some cases that we overestimate the 
> cardinality. For instance, 'ss_sold_date_sk' is the only partition key of 
> tpcds.store_sales. The following query
> {code:sql}
> select count(*) from tpcds.store_sales
> where ss_sold_date_sk=(
>   select min(d_date_sk) + 1000 from tpcds.date_dim);{code}
> has query plan:
> {noformat}
> +-------------------------------------------------------------+
> | Explain String                                              |
> +-------------------------------------------------------------+
> | Max Per-Host Resource Reservation: Memory=18.94MB Threads=6 |
> | Per-Host Resource Estimates: Memory=243MB                   |
> |                                                             |
> | PLAN-ROOT SINK                                              |
> | |                                                           |
> | 09:AGGREGATE [FINALIZE]                                     |
> | |  output: count:merge(*)                                   |
> | |  row-size=8B cardinality=1                                |
> | |                                                           |
> | 08:EXCHANGE [UNPARTITIONED]                                 |
> | |                                                           |
> | 04:AGGREGATE                                                |
> | |  output: count(*)                                         |
> | |  row-size=8B cardinality=1                                |
> | |                                                           |
> | 03:HASH JOIN [LEFT SEMI JOIN, BROADCAST]                    |
> | |  hash predicates: ss_sold_date_sk = min(d_date_sk) + 1000 |
> | |  runtime filters: RF000 <- min(d_date_sk) + 1000          |
> | |  row-size=4B cardinality=2.88M <---- Should be max(numRows) across 
> partitions
> | |                                                           |
> | |--07:EXCHANGE [BROADCAST]                                  |
> | |  |                                                        |
> | |  06:AGGREGATE [FINALIZE]                                  |
> | |  |  output: min:merge(d_date_sk)                          |
> | |  |  row-size=4B cardinality=1                             |
> | |  |                                                        |
> | |  05:EXCHANGE [UNPARTITIONED]                              |
> | |  |                                                        |
> | |  02:AGGREGATE                                             |
> | |  |  output: min(d_date_sk)                                |
> | |  |  row-size=4B cardinality=1                             |
> | |  |                                                        |
> | |  01:SCAN HDFS [tpcds.date_dim]                            |
> | |     HDFS partitions=1/1 files=1 size=9.84MB               |
> | |     row-size=4B cardinality=73.05K                        |
> | |                                                           |
> | 00:SCAN HDFS [tpcds.store_sales]                            |
> |    HDFS partitions=1824/1824 files=1824 size=346.60MB       |
> |    runtime filters: RF000 -> ss_sold_date_sk                |
> |    row-size=4B cardinality=2.88M                            |
> +-------------------------------------------------------------+{noformat}
> CC [~boroknagyz], [~rizaon]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to