Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/24153 )

Change subject: IMPALA-14597: Initial HBO support
......................................................................


Patch Set 15:

(13 comments)

> Patch Set 13:
>
> (10 comments)
>
> I started doing a second pass code review and I saw you responded to a first 
> pass.
>
> So this post is a bit of a mix between responding to what was there and some 
> new comments.
>
> Still need to look at this more today to understand the complete flow.

Thanks for the review! I've done some refactors. Hope the code is cleaner.

http://gerrit.cloudera.org:8080/#/c/24153/13/common/thrift/HBO.thrift
File common/thrift/HBO.thrift:

http://gerrit.cloudera.org:8080/#/c/24153/13/common/thrift/HBO.thrift@47
PS13, Line 47:   3: optional i64 num_input_files
> Maybe not for this release, but something for the future:
That's a good point. It seems thrift will always duplicate the records even if 
they are the same instance in Java. I'll double check this. Currently 
TScanInputStats just has some numbers so the space consumption is small.


http://gerrit.cloudera.org:8080/#/c/24153/13/common/thrift/HBO.thrift@58
PS13, Line 58: }
> I just noticed some code in HistoricalStats.getNumOutputScan that specifica
That's fair. Refactored this to be a map.


http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/planner/ExprCanonicalizer.java
File fe/src/main/java/org/apache/impala/planner/ExprCanonicalizer.java:

http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/planner/ExprCanonicalizer.java@97
PS13, Line 97:       boolean shouldRemoveConstants) {
> I was thinking about this a little more and I'll give you one more argument
Ack. Added a TODO for this.


http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/planner/ExprCanonicalizer.java@128
PS13, Line 128:     // NormalizeBinaryPredicatesRule guarantees that constant 
is on the right side.
> I hear ya.  Double negative is weird.
Ack


http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1944
PS13, Line 1944:       LOG.trace("HBO CONJUNCT STR ({}, {}): {}", statsType, 
strategy, s);
> Can we do the instantiation of TPlanNodeRun here and have this method retur
We need this append sematic since for plan nodes that have multiple children 
(like JoinNode), we call this for each operand to construct the input stats of 
the node.


http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/planner/PlanNode.java
File fe/src/main/java/org/apache/impala/planner/PlanNode.java:

http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/planner/PlanNode.java@1031
PS13, Line 1031:       // Break if the node doesn't support HBO.
> Hmmm...now that I think about it more...why not just put in the Preconditio
Right, in this patch, only HdfsScanNode uses it. I can use Preconditions check 
here. But in the next patch that supports AggregationNode, I need to change it 
back:
https://gerrit.cloudera.org/c/24297/3/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#839

So I'm just lazy to save one change. I can make the change if you want. :)


http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/service/HistoricalStats.java
File fe/src/main/java/org/apache/impala/service/HistoricalStats.java:

http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/service/HistoricalStats.java@40
PS13, Line 40:
> optional nit: BackendConfig.INSTANCE is checked inline here, but the same c
Done


http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/service/HistoricalStats.java@60
PS13, Line 60:     if (!a.isSetScan_input_stats() || 
!b.isSetScan_input_stats()) {
> I may still be misunderstanding this...
Yeah, for EXPR_REWRITE which is the safest and most conservative strategy, we 
don't need that many runs. But more aggressive canonicalization strategy will 
map more runs into the same key. E.g., using IGNORE_PARTITION_CONSTANTS, the 
following scan queries are mapped to the same HBO hash key, assuming the table 
is partitioned by year, month, state:

 select * from tbl where year=2009 and month=1 and state='CA' and age>40;
 select * from tbl where year=2009 and month=2 and state='NY' and age>40;
 select * from tbl where year=2010 and month=1 and state='CA' and age>40;

The HBO hash key for the scan corresponds to this query:

 select * from tbl where year=<CONST> and month=<CONST> and state=<CONST> and 
age>40;

If numRows are similar between two partitions, we can reuse the HBO stats and 
just keep one. E.g., (year=2009, month=1, state='CA') and (year=2009, month=2, 
state='CA') are likely similar. But (year=2009, month=1, state='NY') could have 
numRows that differs a lot. We need to store runs for that and for other kinds 
of partitions as well.


http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/service/HistoricalStats.java@65
PS13, Line 65:     }
> Curious as to why you are passing all the individual members of the structu
Ah, we should simplify this. At the begining I plan to distinguish scan node 
runs and high level plan node runs since I thought scan node runs need 
specifial input stats like catalog version, input file size, etc. But later I 
realize all plan nodes need that.

Refactored this and renamed the method.


http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/service/HistoricalStats.java@96
PS13, Line 96:       for (TScanInputStats s : currRun.getScan_input_stats()) {
> nit: I think this would look better if you did this like the lines above.
Done


http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/service/HistoricalStats.java@126
PS13, Line 126:       long historical = 
run.getScan_input_stats().get(0).getInput_rows();
> Nit:  I think I like the name getSimilarRunWithNumRowsIndex better?  What d
Done. Renamed to getSimilarRunIndexWithNumRows since we return the index.


http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/service/HistoricalStats.java@190
PS13, Line 190:       if (hashKey == null) continue;
> We can do a stronger assert here that size == 1?
PlanNodes with multiple leaves scan nodes will have size > 1 here. We will see 
this in follow-up patches.

BTW, renamed this method to getPlanNodeOutputRows since it will be used in all 
plan node types.


http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/service/HistoricalStatsValue.java
File fe/src/main/java/org/apache/impala/service/HistoricalStatsValue.java:

http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/service/HistoricalStatsValue.java@39
PS13, Line 39:   public HistoricalStatsValue(T initialRun) {
> If we're doing a lot of removes, maybe a LinkedList would be better to prev
Yeah, nice catch!



--
To view, visit http://gerrit.cloudera.org:8080/24153
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6ff60a8bd22c13c0ecad1198934cc96249b1015e
Gerrit-Change-Number: 24153
Gerrit-PatchSet: 15
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Steve Carlin <[email protected]>
Gerrit-Reviewer: Surya Hebbar <[email protected]>
Gerrit-Comment-Date: Tue, 02 Jun 2026 15:12:21 +0000
Gerrit-HasComments: Yes

Reply via email to