Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/20133 )
Change subject: IMPALA-12089: Be able to skip pushing down a subset of the predicates ...................................................................... Patch Set 3: (30 comments) Thanks for your work, Peter! http://gerrit.cloudera.org:8080/#/c/20133/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/20133/3//COMMIT_MSG@9 PS3, Line 9: This change adds a predicate filtering mechanism at planning time that For me, this paragraph was a bit difficult to read. I can imagine that someone with no background knowledge on this optimization would find it hard to understand what this patch does based on this explanation. I'd split this into multiple paragraphs for more readability/ eg.: - what is the current behaviour (all-or-nothing predicate filtering). - What we mean be residual (preds returned from Iceberg after a planFiles() call). Also emphasize that there is pred pushdown in Iceberg and in Impala too and here we optimize the pushdown in Impala. - How we want to achieve this (also a bit more structured description would be nice). - What we gain here. http://gerrit.cloudera.org:8080/#/c/20133/3//COMMIT_MSG@20 PS3, Line 20: compound : expressions created by Impala example pls http://gerrit.cloudera.org:8080/#/c/20133/3//COMMIT_MSG@19 PS3, Line 19: The visitor implementation : handles IN, NULL and binary expressions, and does not handle compound : expressions created by Impala, and aggregates, startsWith, isNan : expressions by Iceberg. Is there a reason for not covering these? due to difficulty in implementation? Are there plans to add this support later on? http://gerrit.cloudera.org:8080/#/c/20133/3//COMMIT_MSG@30 PS3, Line 30: - planner test cases added for disabled mode I see also coverage for the 'enabled' mode in the existing tests. http://gerrit.cloudera.org:8080/#/c/20133/3/common/thrift/ImpalaService.thrift File common/thrift/ImpalaService.thrift: http://gerrit.cloudera.org:8080/#/c/20133/3/common/thrift/ImpalaService.thrift@800 PS3, Line 800: ICEBERG_PREDICATE_SUBSETTING = 158; Wondering if ICEBERG_PREDICATE_PUSHDOWN_SUBSETTING would be an even more self-explanatory name. Might be too long, though. http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/analysis/ImpalaExpressionVisitor.java File fe/src/main/java/org/apache/impala/analysis/ImpalaExpressionVisitor.java: http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/analysis/ImpalaExpressionVisitor.java@45 PS3, Line 45: ImpalaExpressionVisitor this is rather an ImpalaPredicateLocator, isn't it? http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/analysis/ImpalaExpressionVisitor.java@47 PS3, Line 47: OPERATORS A more verbose name with be nice. Like IMPALA_ICEBERG_OPERATOR_MAP http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/analysis/ImpalaExpressionVisitor.java@56 PS3, Line 56: expressions I think impalaExpressions would be a better name as we deal with Iceberg expressions here as well, so would be nice to easily differentiate between them. http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/analysis/ImpalaExpressionVisitor.java@67 PS3, Line 67: if (literal.to(StringType.get()) == literal) { This is a check if T is a string, right? Can't we simply call literal.toString() for all the cases? Wouldn't that work for String literals? http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/analysis/ImpalaExpressionVisitor.java@86 PS3, Line 86: return result; this seems weird, but honestly not 100% up to speed with this Iceberg visitor thing :) Could you add a test to exercise this 'not' operator? I guess we'd need a bool partition column for this. Giving this a second thought, with this visitor we only check the occurrence of certain expressions within the Iceberg expression tree. So I think I get why we just simply return the intermediate results for a 'not' operation as it won't change what expressions we have seen so far. http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/analysis/ImpalaExpressionVisitor.java@102 PS3, Line 102: public <T> ImpalaExpressionVisitorResult predicate(BoundPredicate<T> pred) { Could you help me understand what is this BoundPredicate and why it doesn't require such a treatment as the below UnboundPredicate does? http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/analysis/ImpalaExpressionVisitor.java@112 PS3, Line 112: if (!(pred.term() instanceof NamedReference)) { you could merge L112-114 into one line. http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/analysis/ImpalaExpressionVisitor.java@119 PS3, Line 119: case NOT_NULL: I personally find the case blocks more readable if surrounded by { } chars. I think there are multiple occurrences of this format in the Impala code. http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/analysis/ImpalaExpressionVisitor.java@220 PS3, Line 220: if (new HashSet<>(literals).containsAll(impalaLiterals) For performance considerations I'd check first if the sizes equal and then the containsAll() as it's a less expensive operation and with lazy evaluation we could avoid converting to HashSet and checking the content if the sizes mismatch. http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/analysis/ImpalaExpressionVisitor.java@225 PS3, Line 225: case STARTS_WITH: If these are no-op cases then no need to mention them here with the same block as 'default'. Unless you plan to implement them later on and the presence of these cases are a reminder. But in that case add a TODO pls. http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/analysis/ImpalaExpressionVisitor.java@237 PS3, Line 237: return ImpalaExpressionVisitorResult.EMPTY; Returning empty result here means that this Iceberg aggregate is not supported for Impala's predicate subsetting, right? Could you add tests that cover these? http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/analysis/ImpalaExpressionVisitor.java@263 PS3, Line 263: allMatched locatedAllPredicates() ? http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/analysis/ImpalaExpressionVisitor.java@264 PS3, Line 264: return icebergPredicateCount_ == locatedExpressions_.size() Technically this should return true of icebergPredicateCount_ is zero and locatedExpressions_ is empty, shouldn't it? http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java: http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@121 PS3, Line 121: private List<Expr> skippedImpalaPredicates = new ArrayList<>(); This is a subset of 'impalaIcebergPredicates', right? Could you mention this in the comment? http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@122 PS3, Line 122: are cannot be cannot be http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@125 PS3, Line 125: residualExpressions According to naming convention please add a trailing underscore to these names. http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@408 PS3, Line 408: toRetain predicatesToRetain? http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@411 PS3, Line 411: ImpalaExpressionVisitor This name might be misleading as it says ImpalaExpressionVisitor, however, it is a visitor on the Iceberg expression tree if I'm not mistaken. http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@414 PS3, Line 414: if (result.allMatched()) { I think it would make the code simpler if you negated the condition like this: if (!result.allMatched()) return; // do stuff here http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@416 PS3, Line 416: toRetain.addAll(locatedExpressions); This line could be merged with the above http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/test/java/org/apache/impala/analysis/ImpalaExpressionVisitorTest.java File fe/src/test/java/org/apache/impala/analysis/ImpalaExpressionVisitorTest.java: http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/test/java/org/apache/impala/analysis/ImpalaExpressionVisitorTest.java@219 PS3, Line 219: public void testCompoundExpression() throws SqlCastException { Would be nice to have a comment to describe the expression for better understanding the test case. I think this could be beneficial for the other test cases as well. http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/test/java/org/apache/impala/analysis/ImpalaExpressionVisitorTest.java@248 PS3, Line 248: } Just played with the patch and saw something weird: This skips the predicate on 'year' col. select * from iceberg_partition_evolution where year=2010; while this pushes down the predicate to Impala scanner. select * from iceberg_partition_evolution where year=null; Do you know why this happens? http://gerrit.cloudera.org:8080/#/c/20133/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates-disabled-subsetting.test File testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates-disabled-subsetting.test: http://gerrit.cloudera.org:8080/#/c/20133/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates-disabled-subsetting.test@1 PS3, Line 1: ore typo http://gerrit.cloudera.org:8080/#/c/20133/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates-disabled-subsetting.test@3 PS3, Line 3: select * from iceberg_partitioned where action = "download" and user = "Lisa"; Instead of selecting all columns, I think it also add some extra value to project to a subset of the columns. With this we can also test that the row-size changes based on what predicates we push down, assuming that the columns involved in the predicates aren't included in the select list. http://gerrit.cloudera.org:8080/#/c/20133/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test File testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test: http://gerrit.cloudera.org:8080/#/c/20133/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test@1004 PS3, Line 1004: ==== Is there a test where the conversion between Iceberg residual to Impala predicate fails ('untranslatableImpalaPredicates' not empty) for some reason and Impala decides to subset the predicates being pushed down? -- To view, visit http://gerrit.cloudera.org:8080/20133 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I597f69ad03ecaf9e304613ef934654e3d9614ae8 Gerrit-Change-Number: 20133 Gerrit-PatchSet: 3 Gerrit-Owner: Peter Rozsa <[email protected]> Gerrit-Reviewer: Gabor Kaszab <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Peter Rozsa <[email protected]> Gerrit-Comment-Date: Tue, 04 Jul 2023 14:12:22 +0000 Gerrit-HasComments: Yes
