David Rorke has posted comments on this change. ( http://gerrit.cloudera.org:8080/21279 )
Change subject: IMPALA-12657: Improve ProcessingCost of ScanNode and NonGroupingAggregator ...................................................................... Patch Set 17: (7 comments) http://gerrit.cloudera.org:8080/#/c/21279/17/fe/src/main/java/org/apache/impala/analysis/AggregateInfo.java File fe/src/main/java/org/apache/impala/analysis/AggregateInfo.java: http://gerrit.cloudera.org:8080/#/c/21279/17/fe/src/main/java/org/apache/impala/analysis/AggregateInfo.java@745 PS17, Line 745: int numAggExprs = getMaterializedAggregateExprs().size(); > Why switch to getMaterializedAggregateExprs? AFAICT getMaterializedAggregateExprs().size() should return the count of the non-grouping exprs/functions that will be evaluated, which matches what I need here for calculating an incremental cost per aggregate function per row. Let me know if I'm missing something or there's a better alternative. http://gerrit.cloudera.org:8080/#/c/21279/17/fe/src/main/java/org/apache/impala/analysis/AggregateInfo.java@760 PS17, Line 760: LOG.trace("Total CPU cost estimate: " + totalCost > nit: This could use SLF4J parameterized messages: https://www.slf4j.org/faq Understood. Do we have any strong conventions or standards here. Just looking through the FE code I find a few different formatting approaches. If there's consensus around a different approach I could clean that up in a future patch. http://gerrit.cloudera.org:8080/#/c/21279/17/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java File fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java: http://gerrit.cloudera.org:8080/#/c/21279/17/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java@638 PS17, Line 638: double lhsNetworkCost = (lhsHasCompatPartition) ? > nit: any good reason to reformat this file? Restored the original formatting. http://gerrit.cloudera.org:8080/#/c/21279/17/fe/src/main/java/org/apache/impala/planner/HashJoinNode.java File fe/src/main/java/org/apache/impala/planner/HashJoinNode.java: http://gerrit.cloudera.org:8080/#/c/21279/17/fe/src/main/java/org/apache/impala/planner/HashJoinNode.java@353 PS17, Line 353: // TODO: For broadcast join builds we're underestimating cost here because we're using > It would be nice to add tickets for TODOs. Is this one a significant effort I'll enter a ticket for that. It's not a big effort but also not trivial. I think it will be more important once we have parallel broadcast join builds because the build fragment count won't be fixed at that point. We'll definitely have to revisit this when that change comes in. http://gerrit.cloudera.org:8080/#/c/21279/17/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/21279/17/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@209 PS17, Line 209: // Coefficiencts for estimating scan CPU processing cost. Derived from benchmarking. > typo: should be Coefficients Done http://gerrit.cloudera.org:8080/#/c/21279/17/fe/src/main/java/org/apache/impala/planner/PlanFragment.java File fe/src/main/java/org/apache/impala/planner/PlanFragment.java: http://gerrit.cloudera.org:8080/#/c/21279/17/fe/src/main/java/org/apache/impala/planner/PlanFragment.java@648 PS17, Line 648: // TODO: Should we use AggregationNode.DEFAULT_SKEW_FACTOR when calculating > I guess do we care more about average or worst-case behavior here? Have any I think average case behavior is probably more appropriate for most cases of CPU costing, but given the way the NDV estimate below uses the cardinality it really isn't obvious what the right choice is in this case. We could do a little more modeling of this on paper to understand it better. But I'm comfortable that even without the skew adjustment the current code is an improvement over what we did before. There are other refinements (apart from the skew question) that I think we can make to this so I'll create an overall ticket for additional improvements to our per NDV estimates (or maybe just add some comments to the existing IMPALA-2945 ticket). http://gerrit.cloudera.org:8080/#/c/21279/17/fe/src/main/java/org/apache/impala/planner/PlanFragment.java@656 PS17, Line 656: // TODO: Can we do better than this? > This TODO doesn't feel very actionable. Agree. I've removed this TODO. -- To view, visit http://gerrit.cloudera.org:8080/21279 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icf1edd48d4ae255b7b3b7f5b228800d7bac7d2ca Gerrit-Change-Number: 21279 Gerrit-PatchSet: 17 Gerrit-Owner: David Rorke <[email protected]> Gerrit-Reviewer: Abhishek Rawat <[email protected]> Gerrit-Reviewer: David Rorke <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Kurt Deschler <[email protected]> Gerrit-Reviewer: Michael Smith <[email protected]> Gerrit-Reviewer: Riza Suminto <[email protected]> Gerrit-Reviewer: Wenzhe Zhou <[email protected]> Gerrit-Comment-Date: Thu, 18 Apr 2024 01:15:15 +0000 Gerrit-HasComments: Yes
