Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20133 )

Change subject: IMPALA-12089: Be able to skip pushing down a subset of the 
predicates
......................................................................


Patch Set 3:

(30 comments)

Thanks for your work, Peter!

http://gerrit.cloudera.org:8080/#/c/20133/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20133/3//COMMIT_MSG@9
PS3, Line 9: This change adds a predicate filtering mechanism at planning time 
that
For me, this paragraph was a bit difficult to read. I can imagine that someone 
with no background knowledge on this optimization would find it hard to 
understand what this patch does based on this explanation. I'd split this into 
multiple paragraphs for more readability/ eg.:
- what is the current behaviour (all-or-nothing predicate filtering).
- What we mean be residual (preds returned from Iceberg after a planFiles() 
call). Also emphasize that there is pred pushdown in Iceberg and in Impala too 
and here we optimize the pushdown in Impala.
- How we want to achieve this (also a bit more structured description would be 
nice).
- What we gain here.


http://gerrit.cloudera.org:8080/#/c/20133/3//COMMIT_MSG@20
PS3, Line 20: compound
            : expressions created by Impala
example pls


http://gerrit.cloudera.org:8080/#/c/20133/3//COMMIT_MSG@19
PS3, Line 19: The visitor implementation
            : handles IN, NULL and binary expressions, and does not handle 
compound
            : expressions created by Impala, and aggregates, startsWith, isNan
            : expressions by Iceberg.
Is there a reason for not covering these? due to difficulty in implementation? 
Are there plans to add this support later on?


http://gerrit.cloudera.org:8080/#/c/20133/3//COMMIT_MSG@30
PS3, Line 30:  - planner test cases added for disabled mode
I see also coverage for the 'enabled' mode in the existing tests.


http://gerrit.cloudera.org:8080/#/c/20133/3/common/thrift/ImpalaService.thrift
File common/thrift/ImpalaService.thrift:

http://gerrit.cloudera.org:8080/#/c/20133/3/common/thrift/ImpalaService.thrift@800
PS3, Line 800:   ICEBERG_PREDICATE_SUBSETTING = 158;
Wondering if ICEBERG_PREDICATE_PUSHDOWN_SUBSETTING would be an even more 
self-explanatory name. Might be too long, though.


http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/analysis/ImpalaExpressionVisitor.java
File fe/src/main/java/org/apache/impala/analysis/ImpalaExpressionVisitor.java:

http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/analysis/ImpalaExpressionVisitor.java@45
PS3, Line 45: ImpalaExpressionVisitor
this is rather an ImpalaPredicateLocator, isn't it?


http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/analysis/ImpalaExpressionVisitor.java@47
PS3, Line 47: OPERATORS
A more verbose name with be nice. Like IMPALA_ICEBERG_OPERATOR_MAP


http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/analysis/ImpalaExpressionVisitor.java@56
PS3, Line 56: expressions
I think impalaExpressions would be a better name as we deal with Iceberg 
expressions here as well, so would be nice to easily differentiate between them.


http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/analysis/ImpalaExpressionVisitor.java@67
PS3, Line 67:     if (literal.to(StringType.get()) == literal) {
This is a check if T is a string, right? Can't we simply call 
literal.toString() for all the cases? Wouldn't that work for String literals?


http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/analysis/ImpalaExpressionVisitor.java@86
PS3, Line 86:     return result;
this seems weird, but honestly not 100% up to speed with this Iceberg visitor 
thing :)
Could you add a test to exercise this 'not' operator? I guess we'd need a bool 
partition column for this.

Giving this a second thought, with this visitor we only check the occurrence of 
certain expressions within the Iceberg expression tree. So I think I get why we 
just simply return the intermediate results for a 'not' operation as it won't 
change what expressions we have seen so far.


http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/analysis/ImpalaExpressionVisitor.java@102
PS3, Line 102:   public <T> ImpalaExpressionVisitorResult 
predicate(BoundPredicate<T> pred) {
Could you help me understand what is this BoundPredicate and why it doesn't 
require such a treatment as the below UnboundPredicate does?


http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/analysis/ImpalaExpressionVisitor.java@112
PS3, Line 112:     if (!(pred.term() instanceof NamedReference)) {
you could merge L112-114 into one line.


http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/analysis/ImpalaExpressionVisitor.java@119
PS3, Line 119:         case NOT_NULL:
I personally find the case blocks more readable if surrounded by { } chars. I 
think there are multiple occurrences of this format in the Impala code.


http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/analysis/ImpalaExpressionVisitor.java@220
PS3, Line 220:             if (new 
HashSet<>(literals).containsAll(impalaLiterals)
For performance considerations I'd check first if the sizes equal and then the 
containsAll() as it's a less expensive operation and with lazy evaluation we 
could avoid converting to HashSet and checking the content if the sizes 
mismatch.


http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/analysis/ImpalaExpressionVisitor.java@225
PS3, Line 225:         case STARTS_WITH:
If these are no-op cases then no need to mention them here with the same block 
as 'default'. Unless you plan to implement them later on and the presence of 
these cases are a reminder. But in that case add a TODO pls.


http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/analysis/ImpalaExpressionVisitor.java@237
PS3, Line 237:     return ImpalaExpressionVisitorResult.EMPTY;
Returning empty result here means that this Iceberg aggregate is not supported 
for Impala's predicate subsetting, right? Could you add tests that cover these?


http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/analysis/ImpalaExpressionVisitor.java@263
PS3, Line 263: allMatched
locatedAllPredicates() ?


http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/analysis/ImpalaExpressionVisitor.java@264
PS3, Line 264:       return icebergPredicateCount_ == locatedExpressions_.size()
Technically this should return true of icebergPredicateCount_ is zero and 
locatedExpressions_ is empty, shouldn't it?


http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java:

http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@121
PS3, Line 121:   private List<Expr> skippedImpalaPredicates = new ArrayList<>();
This is a subset of 'impalaIcebergPredicates', right? Could you mention this in 
the comment?


http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@122
PS3, Line 122: are cannot be
cannot be


http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@125
PS3, Line 125: residualExpressions
According to naming convention please add a trailing underscore to these names.


http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@408
PS3, Line 408: toRetain
predicatesToRetain?


http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@411
PS3, Line 411: ImpalaExpressionVisitor
This name might be misleading as it says ImpalaExpressionVisitor, however, it 
is a visitor on the Iceberg expression tree if I'm not mistaken.


http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@414
PS3, Line 414:           if (result.allMatched()) {
I think it would make the code simpler if you negated the condition like this:

if (!result.allMatched()) return;
// do stuff here


http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@416
PS3, Line 416:             toRetain.addAll(locatedExpressions);
This line could be merged with the above


http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/test/java/org/apache/impala/analysis/ImpalaExpressionVisitorTest.java
File 
fe/src/test/java/org/apache/impala/analysis/ImpalaExpressionVisitorTest.java:

http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/test/java/org/apache/impala/analysis/ImpalaExpressionVisitorTest.java@219
PS3, Line 219:   public void testCompoundExpression() throws SqlCastException {
Would be nice to have a comment to describe the expression for better 
understanding the test case. I think this could be beneficial for the other 
test cases as well.


http://gerrit.cloudera.org:8080/#/c/20133/3/fe/src/test/java/org/apache/impala/analysis/ImpalaExpressionVisitorTest.java@248
PS3, Line 248: }
Just played with the patch and saw something weird:
This skips the predicate on 'year' col.
select * from iceberg_partition_evolution where year=2010;

while this pushes down the predicate to Impala scanner.
select * from iceberg_partition_evolution where year=null;

Do you know why this happens?


http://gerrit.cloudera.org:8080/#/c/20133/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates-disabled-subsetting.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates-disabled-subsetting.test:

http://gerrit.cloudera.org:8080/#/c/20133/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates-disabled-subsetting.test@1
PS3, Line 1: ore
typo


http://gerrit.cloudera.org:8080/#/c/20133/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates-disabled-subsetting.test@3
PS3, Line 3: select * from iceberg_partitioned where action = "download" and 
user = "Lisa";
Instead of selecting all columns, I think it also add some extra value to 
project to a subset of the columns. With this we can also test that the 
row-size changes based on what predicates we push down, assuming that the 
columns involved in the predicates aren't included in the select list.


http://gerrit.cloudera.org:8080/#/c/20133/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test:

http://gerrit.cloudera.org:8080/#/c/20133/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test@1004
PS3, Line 1004: ====
Is there a test where the conversion between Iceberg residual to Impala 
predicate fails ('untranslatableImpalaPredicates' not empty) for some reason 
and Impala decides to subset the predicates being pushed down?



--
To view, visit http://gerrit.cloudera.org:8080/20133
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I597f69ad03ecaf9e304613ef934654e3d9614ae8
Gerrit-Change-Number: 20133
Gerrit-PatchSet: 3
Gerrit-Owner: Peter Rozsa <[email protected]>
Gerrit-Reviewer: Gabor Kaszab <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Peter Rozsa <[email protected]>
Gerrit-Comment-Date: Tue, 04 Jul 2023 14:12:22 +0000
Gerrit-HasComments: Yes

Reply via email to