[Impala-ASF-CR] IMPALA-12657: Improve ProcessingCost of ScanNode and NonGroupingAggregator

David Rorke (Code Review) Wed, 17 Apr 2024 18:20:37 -0700

David Rorke has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21279 )


Change subject: IMPALA-12657: Improve ProcessingCost of ScanNode and 
NonGroupingAggregator
......................................................................


Patch Set 17:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/21279/17/fe/src/main/java/org/apache/impala/analysis/AggregateInfo.java
File fe/src/main/java/org/apache/impala/analysis/AggregateInfo.java:

http://gerrit.cloudera.org:8080/#/c/21279/17/fe/src/main/java/org/apache/impala/analysis/AggregateInfo.java@745
PS17, Line 745:     int numAggExprs = getMaterializedAggregateExprs().size();
> Why switch to getMaterializedAggregateExprs?
AFAICT getMaterializedAggregateExprs().size() should return the count of the 
non-grouping exprs/functions that will be evaluated, which matches what I need 
here for calculating an incremental cost per aggregate function per row. Let me 
know if I'm missing something or there's a better alternative.


http://gerrit.cloudera.org:8080/#/c/21279/17/fe/src/main/java/org/apache/impala/analysis/AggregateInfo.java@760
PS17, Line 760:       LOG.trace("Total CPU cost estimate: " + totalCost
> nit: This could use SLF4J parameterized messages: https://www.slf4j.org/faq
Understood. Do we have any strong conventions or standards here. Just looking 
through the FE code I find a few different formatting approaches. If there's 
consensus around a different approach I could clean that up in a future patch.


http://gerrit.cloudera.org:8080/#/c/21279/17/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
File fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java:

http://gerrit.cloudera.org:8080/#/c/21279/17/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java@638
PS17, Line 638:       double lhsNetworkCost = (lhsHasCompatPartition) ?
> nit: any good reason to reformat this file?
Restored the original formatting.


http://gerrit.cloudera.org:8080/#/c/21279/17/fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
File fe/src/main/java/org/apache/impala/planner/HashJoinNode.java:

http://gerrit.cloudera.org:8080/#/c/21279/17/fe/src/main/java/org/apache/impala/planner/HashJoinNode.java@353
PS17, Line 353:     // TODO: For broadcast join builds we're underestimating 
cost here because we're using
> It would be nice to add tickets for TODOs. Is this one a significant effort
I'll enter a ticket for that. It's not a big effort but also not trivial. I 
think it will be more important once we have parallel broadcast join builds 
because the build fragment count won't be fixed at that point. We'll definitely 
have to revisit this when that change comes in.


http://gerrit.cloudera.org:8080/#/c/21279/17/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/21279/17/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@209
PS17, Line 209:   // Coefficiencts for estimating scan CPU processing cost. 
Derived from benchmarking.
> typo: should be Coefficients
Done


http://gerrit.cloudera.org:8080/#/c/21279/17/fe/src/main/java/org/apache/impala/planner/PlanFragment.java
File fe/src/main/java/org/apache/impala/planner/PlanFragment.java:

http://gerrit.cloudera.org:8080/#/c/21279/17/fe/src/main/java/org/apache/impala/planner/PlanFragment.java@648
PS17, Line 648:     // TODO: Should we use AggregationNode.DEFAULT_SKEW_FACTOR 
when calculating
> I guess do we care more about average or worst-case behavior here? Have any
I think average case behavior is probably more appropriate for most cases of 
CPU costing, but given the way the NDV estimate below uses the cardinality it 
really isn't obvious what the right choice is in this case. We could do a 
little more modeling of this on paper to understand it better. But I'm 
comfortable that even without the skew adjustment the current code is an 
improvement over what we did before.

There are other refinements (apart from the skew question) that I think we can 
make to this so I'll create an overall ticket for additional improvements to 
our per NDV estimates (or maybe just add some comments to the existing 
IMPALA-2945 ticket).


http://gerrit.cloudera.org:8080/#/c/21279/17/fe/src/main/java/org/apache/impala/planner/PlanFragment.java@656
PS17, Line 656:         // TODO: Can we do better than this?
> This TODO doesn't feel very actionable.
Agree. I've removed this TODO.



--
To view, visit http://gerrit.cloudera.org:8080/21279
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icf1edd48d4ae255b7b3b7f5b228800d7bac7d2ca
Gerrit-Change-Number: 21279
Gerrit-PatchSet: 17
Gerrit-Owner: David Rorke <[email protected]>
Gerrit-Reviewer: Abhishek Rawat <[email protected]>
Gerrit-Reviewer: David Rorke <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Kurt Deschler <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>
Gerrit-Reviewer: Wenzhe Zhou <[email protected]>
Gerrit-Comment-Date: Thu, 18 Apr 2024 01:15:15 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-12657: Improve ProcessingCost of ScanNode and NonGroupingAggregator

Reply via email to