Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/20498 )
Change subject: IMPALA-12018: Consider runtime filter for cardinality reduction ...................................................................... Patch Set 3: (3 comments) The implementation looks good to me, but i have some high level questions. http://gerrit.cloudera.org:8080/#/c/20498/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/20498/3//COMMIT_MSG@15 PS3, Line 15: This patch applies runtime filter selectivity to lower cardinality : estimates of scan nodes and certain join nodes above them after runtime : filter generation and before resource requirement computation. I think that ideally we shouldn't use join node selectivity, but the NDV of the key on the build side. Generally we assume that selectivity on the join node will also reduce the NDV of the key, so the two becomes interchangeable, but I would prefer if this logic would stay within join node/runtime filter code, and not spread to other nodes. This is how I see the steps that lead to cardinality reduction in the scan node: 1. ndv(key) is calculated in the join builder based on build side columns stats and estimated selectivity 2. a runtime filter is created for key, which also has an fpp based on ndv(key) and bloom filter size 3. the scan node can calculate bloom filter selectivity based on its own ndv (after applying other predicates), the ndv from the bloom filter and fpp of the bloom filter This probably gives the same results as the current solution, but makes it easier to think about the effect of other predicates / runtime filters / duplicate keys on build side. A good example where this could help is a join with huge number of duplicates + a very selective filter on the build side. We could assume that the build side ndv(key) is reduced due to the selective predicate, but getJoinNodeSelectivity() would be still > 0 due to the duplicate matches. So we wouldn't reduce scanner cardinality while we could assume that many rows are dropped based on the bloom filter. http://gerrit.cloudera.org:8080/#/c/20498/3//COMMIT_MSG@34 PS3, Line 34: Testing: It would be nice to have some targeted tests for edge cases, e.g. missing stats, duplicate matches in join, effect of other predicates. http://gerrit.cloudera.org:8080/#/c/20498/3/fe/src/main/java/org/apache/impala/planner/Planner.java File fe/src/main/java/org/apache/impala/planner/Planner.java: http://gerrit.cloudera.org:8080/#/c/20498/3/fe/src/main/java/org/apache/impala/planner/Planner.java@538 PS3, Line 538: CardinalityRefinerVisitor I don't have a clear plan, but couldn't this be turned into a recursive function in PlanNode? -- To view, visit http://gerrit.cloudera.org:8080/20498 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I033789c9b63a8188484e3afde8e646563918b3e1 Gerrit-Change-Number: 20498 Gerrit-PatchSet: 3 Gerrit-Owner: Riza Suminto <riza.sumi...@cloudera.com> Gerrit-Reviewer: Abhishek Rawat <ara...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: David Rorke <dro...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com> Gerrit-Comment-Date: Tue, 24 Oct 2023 09:17:30 +0000 Gerrit-HasComments: Yes