[
https://issues.apache.org/jira/browse/IMPALA-13077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17869994#comment-17869994
]
ASF subversion and git services commented on IMPALA-13077:
----------------------------------------------------------
Commit 84484823cb476bc472027ccd9b829b557824ac34 in impala's branch
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=84484823c ]
IMPALA-13077: Fix selectivity estimation for SEMI JOIN
JoinNode.getSemiJoinCardinality() will skip an equality expression if
either NDV or Cardinality of equality expression is unknown (-1). This
patch fix the unknown NDV issue by making JoinNode.getNdv() wraps around
ColumnStats.fromExpr().
Testing:
- Add test case where LEFT SEMI JOIN from subquery can reduce
cardinality estimate of leftmost ScanNode in the query plan.
- Add new pattern at CARDINALITY_FILTER to ignore reduction by runtime
filter.
- Pass core tests.
Change-Id: I9c799df535d764c3f87ededef1c48eaa103293a0
Reviewed-on: http://gerrit.cloudera.org:8080/21516
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Equality predicate on partition column and uncorrelated subquery doesn't
> reduce the cardinality estimate
> --------------------------------------------------------------------------------------------------------
>
> Key: IMPALA-13077
> URL: https://issues.apache.org/jira/browse/IMPALA-13077
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Reporter: Quanlong Huang
> Assignee: Riza Suminto
> Priority: Critical
>
> Let's say 'part_tbl' is a partitioned table. Its partition key is 'part_key'.
> Consider the following query:
> {code:sql}
> select xxx from part_tbl
> where part_key=(select ... from dim_tbl);
> {code}
> Its query plan is a JoinNode with two ScanNodes. When estimating the
> cardinality of the JoinNode, the planner is not aware that 'part_key' is the
> partition column and the cardinality of the JoinNode should not be larger
> than the max row count across partitions.
> The recent work in IMPALA-12018 (Consider runtime filter for cardinality
> reduction) helps in some cases since there are runtime filters on the
> partition column. But there are still some cases that we overestimate the
> cardinality. For instance, 'ss_sold_date_sk' is the only partition key of
> tpcds.store_sales. The following query
> {code:sql}
> select count(*) from tpcds.store_sales
> where ss_sold_date_sk=(
> select min(d_date_sk) + 1000 from tpcds.date_dim);{code}
> has query plan:
> {noformat}
> +-------------------------------------------------------------+
> | Explain String |
> +-------------------------------------------------------------+
> | Max Per-Host Resource Reservation: Memory=18.94MB Threads=6 |
> | Per-Host Resource Estimates: Memory=243MB |
> | |
> | PLAN-ROOT SINK |
> | | |
> | 09:AGGREGATE [FINALIZE] |
> | | output: count:merge(*) |
> | | row-size=8B cardinality=1 |
> | | |
> | 08:EXCHANGE [UNPARTITIONED] |
> | | |
> | 04:AGGREGATE |
> | | output: count(*) |
> | | row-size=8B cardinality=1 |
> | | |
> | 03:HASH JOIN [LEFT SEMI JOIN, BROADCAST] |
> | | hash predicates: ss_sold_date_sk = min(d_date_sk) + 1000 |
> | | runtime filters: RF000 <- min(d_date_sk) + 1000 |
> | | row-size=4B cardinality=2.88M <---- Should be max(numRows) across
> partitions
> | | |
> | |--07:EXCHANGE [BROADCAST] |
> | | | |
> | | 06:AGGREGATE [FINALIZE] |
> | | | output: min:merge(d_date_sk) |
> | | | row-size=4B cardinality=1 |
> | | | |
> | | 05:EXCHANGE [UNPARTITIONED] |
> | | | |
> | | 02:AGGREGATE |
> | | | output: min(d_date_sk) |
> | | | row-size=4B cardinality=1 |
> | | | |
> | | 01:SCAN HDFS [tpcds.date_dim] |
> | | HDFS partitions=1/1 files=1 size=9.84MB |
> | | row-size=4B cardinality=73.05K |
> | | |
> | 00:SCAN HDFS [tpcds.store_sales] |
> | HDFS partitions=1824/1824 files=1824 size=346.60MB |
> | runtime filters: RF000 -> ss_sold_date_sk |
> | row-size=4B cardinality=2.88M |
> +-------------------------------------------------------------+{noformat}
> CC [~boroknagyz], [~rizaon]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]