David Rorke has posted comments on this change. ( http://gerrit.cloudera.org:8080/21279 )
Change subject: IMPALA-12657: Improve ProcessingCost of ScanNode and NonGroupingAggregator ...................................................................... Patch Set 13: (14 comments) http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/BaseProcessingCost.java File fe/src/main/java/org/apache/impala/planner/BaseProcessingCost.java: http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/BaseProcessingCost.java@30 PS13, Line 30: public BaseProcessingCost( : long cardinality, float exprsCost, float materializationCost) { : // TODO: materializationCost accommodate ProcessingCost where row width should be : // factor in. Currently, ProcessingCost of ScanNode, ExchangeNode, and DataStreamSink : // has row width factored in through materialization parameter here. Investigate if : // other operator need to have its row width factored in as well and whether we should : // have specific 'rowWidth' parameter here. > ... removed or simplified ... There are still some scan node types (e.g. Kudu, HBase) that don't use the benchmark-based costing and depend on this form of the constructor (via ScanNode.computeProcessingCost). I left the constructor in it's current form but clarified the comment. http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/BaseProcessingCost.java@77 PS13, Line 77: if (cardinality_ != 0L) { output.append(" cardinality=").append(cardinality_); } : output.append(" total-cost=").append(totalCost_); > nit: Print total-cost first before cardinality since the latter is now opti Done http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/DataStreamSink.java File fe/src/main/java/org/apache/impala/planner/DataStreamSink.java: http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/DataStreamSink.java@117 PS13, Line 117: LOG.trace("Total CPU cost estimate: " + totalCost : + ", Output Card: " + outputCardinality + ", Output Size: " + outputSize); > nit: Please log which branch from above that totalCost is calculated from. Done http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/ExchangeNode.java File fe/src/main/java/org/apache/impala/planner/ExchangeNode.java: http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/ExchangeNode.java@230 PS13, Line 230: public static double getAvgDeserializedRowSize > Can this and getAvgSerializedRowSize() turned into non-static class method? Changed both methods to non-static. The use of the serialized size in DistributedPlanner seems to make sense since they're calculating some network costs there. In the case of the callers to getAvgDeserializedRowSize, they're all computing costs based on benchmark coefficients that were calculated using deserialized sizes so that seems like the safer choice (see the comment in ExchangeNode.computeProcessingCost). http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/ExchangeNode.java@277 PS13, Line 277: LOG.trace("Total CPU cost estimate: " + totalCost : + ", Input Card: " + inputCardinality + ", Input Size: " + inputSize); > nit: Please log which branch from above that totalCost is calculated from. Done http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/HashJoinNode.java File fe/src/main/java/org/apache/impala/planner/HashJoinNode.java: http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/HashJoinNode.java@342 PS13, Line 342: double totalProbeCost = : (getProbeCardinalityForCosting() * COST_COEFFICIENT_PROBE_INPUT) : + (getCardinality() * COST_COEFFICIENT_HASH_JOIN_OUTPUT); > Is conjunct evaluation not counted anymore towards probing cost? Not as a separate factor this point. Conjunct eval cost is included in the benchmarked times used to derive the coefficients but the benchmarking results so far didn't justify breaking the conjunct eval cost out as a separate factor (it didn't appear to be a significant distinguishable "feature" in the cost model). I think it could still be a factor in some cases but we can refine this in the future. http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2226 PS13, Line 2226: getAvgRowSize() - getRowPadSize() > nit: create and use getAvgRowSizeWithoutPad(). Done http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java File fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java: http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java@162 PS13, Line 162: inputNode.getAvgRowSize() - inputNode.getRowPadSize() > nit: Create and use getAvgRowSizeWithoutPad() in PlanNode.java (see my comm Done http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java@179 PS13, Line 179: LOG.trace("HdfsTableSink insert CPU cost estimate: " + totalCost + ", Cardinality: " : + cardinality + ", Estimated Bytes Inserted: " + estBytesInserted); > nit: Please log which branch from above that totalCost is calculated from. Done http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/NestedLoopJoinNode.java File fe/src/main/java/org/apache/impala/planner/NestedLoopJoinNode.java: http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/NestedLoopJoinNode.java@117 PS13, Line 117: buildCardinality / fragment_.getNumInstancesForCosting() > Does this require Math.ceil()? Done http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/PlanNode.java File fe/src/main/java/org/apache/impala/planner/PlanNode.java: http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/PlanNode.java@251 PS13, Line 251: public float getAvgRowSize() { return avgRowSize_; } : public float getRowPadSize() { return rowPadSize_; } > Consider add the following and use it when relevant: Done http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/ProcessingCost.java File fe/src/main/java/org/apache/impala/planner/ProcessingCost.java: http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/ProcessingCost.java@88 PS13, Line 88: public static ProcessingCost basicCost( : String label, long cardinality, float exprsCost, float materializationCost) { > Looks like only ScanNode.java call this method. Consider removing this and There are still callers of ScanNode.computeScanProcessing cost that haven't been converted to the new benchmark-based costing and their costing still relies on computation using exprs and materailization cost. Later when those are all converted to the benchmark-based costing we can clean this up. http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/ProcessingCost.java@98 PS13, Line 98: computeValidBaseCost(cardinality, exprsCost, 0) > Consider replacing this with computeValidBaseCost(double totalCost). I think there are still operators that rely on this form of basic cost because they haven't been converted to the benchmark-based costing. Will clean this up later when all uses are gone. http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/UnionNode.java File fe/src/main/java/org/apache/impala/planner/UnionNode.java: http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/UnionNode.java@175 PS13, Line 175: LOG.trace("Union CPU cost estimate: " + totalCost : + ", estBytesMaterialized: " + estBytesMaterialized : + ", totalMaterializedCardinality: " + totalMaterializedCardinality); > nit: log resultExprLists_.size(). Done -- To view, visit http://gerrit.cloudera.org:8080/21279 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icf1edd48d4ae255b7b3b7f5b228800d7bac7d2ca Gerrit-Change-Number: 21279 Gerrit-PatchSet: 13 Gerrit-Owner: David Rorke <dro...@cloudera.com> Gerrit-Reviewer: Abhishek Rawat <ara...@cloudera.com> Gerrit-Reviewer: David Rorke <dro...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Kurt Deschler <kdesc...@cloudera.com> Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com> Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com> Gerrit-Comment-Date: Wed, 17 Apr 2024 18:51:58 +0000 Gerrit-HasComments: Yes