Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20498 )

Change subject: IMPALA-12018: Consider runtime filter for cardinality reduction
......................................................................


Patch Set 3:

(3 comments)

The implementation looks good to me, but i have some high level questions.

http://gerrit.cloudera.org:8080/#/c/20498/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20498/3//COMMIT_MSG@15
PS3, Line 15: This patch applies runtime filter selectivity to lower cardinality
            : estimates of scan nodes and certain join nodes above them after 
runtime
            : filter generation and before resource requirement computation.
I think that ideally we shouldn't use join node selectivity, but the NDV of the 
key on the build side. Generally we assume that selectivity on the join node 
will also reduce the NDV of the key, so the two becomes interchangeable, but I 
would prefer if this logic would stay within join node/runtime filter code, and 
not spread to other nodes.

This is how I see the steps that lead to cardinality reduction in the scan node:
1. ndv(key) is calculated in the join builder based on build side columns stats 
and estimated selectivity
2. a runtime filter is created for key, which also has an fpp based on ndv(key) 
and bloom filter size
3. the scan node can calculate bloom filter selectivity based on its own ndv 
(after applying other predicates), the ndv from the bloom filter and fpp of the 
bloom filter

This probably gives the same results as the current solution, but makes it 
easier to think about the effect of other predicates / runtime filters / 
duplicate keys on build side.

A good example where this could help is a join with huge number of duplicates + 
a very selective filter on the build side. We could assume that the build side 
ndv(key) is reduced due to the selective predicate, but 
getJoinNodeSelectivity() would be still > 0 due to the duplicate matches. So we 
wouldn't reduce scanner cardinality while we could assume that many rows are 
dropped based on the bloom filter.


http://gerrit.cloudera.org:8080/#/c/20498/3//COMMIT_MSG@34
PS3, Line 34: Testing:
It would be nice to have some targeted tests for edge cases, e.g. missing 
stats, duplicate matches in join, effect of other predicates.


http://gerrit.cloudera.org:8080/#/c/20498/3/fe/src/main/java/org/apache/impala/planner/Planner.java
File fe/src/main/java/org/apache/impala/planner/Planner.java:

http://gerrit.cloudera.org:8080/#/c/20498/3/fe/src/main/java/org/apache/impala/planner/Planner.java@538
PS3, Line 538: CardinalityRefinerVisitor
I don't have a clear plan, but couldn't this be turned into a recursive 
function in PlanNode?



--
To view, visit http://gerrit.cloudera.org:8080/20498
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I033789c9b63a8188484e3afde8e646563918b3e1
Gerrit-Change-Number: 20498
Gerrit-PatchSet: 3
Gerrit-Owner: Riza Suminto <riza.sumi...@cloudera.com>
Gerrit-Reviewer: Abhishek Rawat <ara...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: David Rorke <dro...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com>
Gerrit-Comment-Date: Tue, 24 Oct 2023 09:17:30 +0000
Gerrit-HasComments: Yes

Reply via email to