Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/24153 )
Change subject: IMPALA-14597: Initial HBO support ...................................................................... Patch Set 15: (13 comments) > Patch Set 13: > > (10 comments) > > I started doing a second pass code review and I saw you responded to a first > pass. > > So this post is a bit of a mix between responding to what was there and some > new comments. > > Still need to look at this more today to understand the complete flow. Thanks for the review! I've done some refactors. Hope the code is cleaner. http://gerrit.cloudera.org:8080/#/c/24153/13/common/thrift/HBO.thrift File common/thrift/HBO.thrift: http://gerrit.cloudera.org:8080/#/c/24153/13/common/thrift/HBO.thrift@47 PS13, Line 47: 3: optional i64 num_input_files > Maybe not for this release, but something for the future: That's a good point. It seems thrift will always duplicate the records even if they are the same instance in Java. I'll double check this. Currently TScanInputStats just has some numbers so the space consumption is small. http://gerrit.cloudera.org:8080/#/c/24153/13/common/thrift/HBO.thrift@58 PS13, Line 58: } > I just noticed some code in HistoricalStats.getNumOutputScan that specifica That's fair. Refactored this to be a map. http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/planner/ExprCanonicalizer.java File fe/src/main/java/org/apache/impala/planner/ExprCanonicalizer.java: http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/planner/ExprCanonicalizer.java@97 PS13, Line 97: boolean shouldRemoveConstants) { > I was thinking about this a little more and I'll give you one more argument Ack. Added a TODO for this. http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/planner/ExprCanonicalizer.java@128 PS13, Line 128: // NormalizeBinaryPredicatesRule guarantees that constant is on the right side. > I hear ya. Double negative is weird. Ack http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1944 PS13, Line 1944: LOG.trace("HBO CONJUNCT STR ({}, {}): {}", statsType, strategy, s); > Can we do the instantiation of TPlanNodeRun here and have this method retur We need this append sematic since for plan nodes that have multiple children (like JoinNode), we call this for each operand to construct the input stats of the node. http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/planner/PlanNode.java File fe/src/main/java/org/apache/impala/planner/PlanNode.java: http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/planner/PlanNode.java@1031 PS13, Line 1031: // Break if the node doesn't support HBO. > Hmmm...now that I think about it more...why not just put in the Preconditio Right, in this patch, only HdfsScanNode uses it. I can use Preconditions check here. But in the next patch that supports AggregationNode, I need to change it back: https://gerrit.cloudera.org/c/24297/3/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#839 So I'm just lazy to save one change. I can make the change if you want. :) http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/service/HistoricalStats.java File fe/src/main/java/org/apache/impala/service/HistoricalStats.java: http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/service/HistoricalStats.java@40 PS13, Line 40: > optional nit: BackendConfig.INSTANCE is checked inline here, but the same c Done http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/service/HistoricalStats.java@60 PS13, Line 60: if (!a.isSetScan_input_stats() || !b.isSetScan_input_stats()) { > I may still be misunderstanding this... Yeah, for EXPR_REWRITE which is the safest and most conservative strategy, we don't need that many runs. But more aggressive canonicalization strategy will map more runs into the same key. E.g., using IGNORE_PARTITION_CONSTANTS, the following scan queries are mapped to the same HBO hash key, assuming the table is partitioned by year, month, state: select * from tbl where year=2009 and month=1 and state='CA' and age>40; select * from tbl where year=2009 and month=2 and state='NY' and age>40; select * from tbl where year=2010 and month=1 and state='CA' and age>40; The HBO hash key for the scan corresponds to this query: select * from tbl where year=<CONST> and month=<CONST> and state=<CONST> and age>40; If numRows are similar between two partitions, we can reuse the HBO stats and just keep one. E.g., (year=2009, month=1, state='CA') and (year=2009, month=2, state='CA') are likely similar. But (year=2009, month=1, state='NY') could have numRows that differs a lot. We need to store runs for that and for other kinds of partitions as well. http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/service/HistoricalStats.java@65 PS13, Line 65: } > Curious as to why you are passing all the individual members of the structu Ah, we should simplify this. At the begining I plan to distinguish scan node runs and high level plan node runs since I thought scan node runs need specifial input stats like catalog version, input file size, etc. But later I realize all plan nodes need that. Refactored this and renamed the method. http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/service/HistoricalStats.java@96 PS13, Line 96: for (TScanInputStats s : currRun.getScan_input_stats()) { > nit: I think this would look better if you did this like the lines above. Done http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/service/HistoricalStats.java@126 PS13, Line 126: long historical = run.getScan_input_stats().get(0).getInput_rows(); > Nit: I think I like the name getSimilarRunWithNumRowsIndex better? What d Done. Renamed to getSimilarRunIndexWithNumRows since we return the index. http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/service/HistoricalStats.java@190 PS13, Line 190: if (hashKey == null) continue; > We can do a stronger assert here that size == 1? PlanNodes with multiple leaves scan nodes will have size > 1 here. We will see this in follow-up patches. BTW, renamed this method to getPlanNodeOutputRows since it will be used in all plan node types. http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/service/HistoricalStatsValue.java File fe/src/main/java/org/apache/impala/service/HistoricalStatsValue.java: http://gerrit.cloudera.org:8080/#/c/24153/13/fe/src/main/java/org/apache/impala/service/HistoricalStatsValue.java@39 PS13, Line 39: public HistoricalStatsValue(T initialRun) { > If we're doing a lot of removes, maybe a LinkedList would be better to prev Yeah, nice catch! -- To view, visit http://gerrit.cloudera.org:8080/24153 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6ff60a8bd22c13c0ecad1198934cc96249b1015e Gerrit-Change-Number: 24153 Gerrit-PatchSet: 15 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Michael Smith <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Steve Carlin <[email protected]> Gerrit-Reviewer: Surya Hebbar <[email protected]> Gerrit-Comment-Date: Tue, 02 Jun 2026 15:12:21 +0000 Gerrit-HasComments: Yes
