David Rorke has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21279 )

Change subject: IMPALA-12657: Improve ProcessingCost of ScanNode and 
NonGroupingAggregator
......................................................................


Patch Set 13:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/BaseProcessingCost.java
File fe/src/main/java/org/apache/impala/planner/BaseProcessingCost.java:

http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/BaseProcessingCost.java@30
PS13, Line 30: public BaseProcessingCost(
             :       long cardinality, float exprsCost, float 
materializationCost) {
             :     // TODO: materializationCost accommodate ProcessingCost 
where row width should be
             :     // factor in. Currently, ProcessingCost of ScanNode, 
ExchangeNode, and DataStreamSink
             :     // has row width factored in through materialization 
parameter here. Investigate if
             :     // other operator need to have its row width factored in as 
well and whether we should
             :     // have specific 'rowWidth' parameter here.
> ... removed or simplified ...
There are still some scan node types (e.g. Kudu, HBase) that don't use the 
benchmark-based costing and depend on this form of the constructor (via 
ScanNode.computeProcessingCost).  I left the constructor in it's current form 
but clarified the comment.


http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/BaseProcessingCost.java@77
PS13, Line 77: if (cardinality_ != 0L) { output.append(" 
cardinality=").append(cardinality_); }
             :     output.append(" total-cost=").append(totalCost_);
> nit: Print total-cost first before cardinality since the latter is now opti
Done


http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/DataStreamSink.java
File fe/src/main/java/org/apache/impala/planner/DataStreamSink.java:

http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/DataStreamSink.java@117
PS13, Line 117: LOG.trace("Total CPU cost estimate: " + totalCost
              :           + ", Output Card: " + outputCardinality + ", Output 
Size: " + outputSize);
> nit: Please log which branch from above that totalCost is calculated from.
Done


http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/ExchangeNode.java
File fe/src/main/java/org/apache/impala/planner/ExchangeNode.java:

http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/ExchangeNode.java@230
PS13, Line 230: public static double getAvgDeserializedRowSize
> Can this and getAvgSerializedRowSize() turned into non-static class method?
Changed both methods to non-static.

The use of the serialized size in DistributedPlanner seems to make sense since 
they're calculating some network costs there. In the case of the callers to 
getAvgDeserializedRowSize, they're all computing costs based on benchmark 
coefficients that were calculated using deserialized sizes so that seems like 
the safer choice (see the comment in ExchangeNode.computeProcessingCost).


http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/ExchangeNode.java@277
PS13, Line 277: LOG.trace("Total CPU cost estimate: " + totalCost
              :           + ", Input Card: " + inputCardinality + ", Input 
Size: " + inputSize);
> nit: Please log which branch from above that totalCost is calculated from.
Done


http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
File fe/src/main/java/org/apache/impala/planner/HashJoinNode.java:

http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/HashJoinNode.java@342
PS13, Line 342: double totalProbeCost =
              :         (getProbeCardinalityForCosting() * 
COST_COEFFICIENT_PROBE_INPUT)
              :         + (getCardinality() * 
COST_COEFFICIENT_HASH_JOIN_OUTPUT);
> Is conjunct evaluation not counted anymore towards probing cost?
Not as a separate factor this point. Conjunct eval cost is included in the 
benchmarked times used to derive the coefficients but the benchmarking results 
so far didn't justify breaking the conjunct eval cost out as a separate factor 
(it didn't appear to be a significant distinguishable "feature" in the cost 
model). I think it could still be a factor in some cases but we can refine this 
in the future.


http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2226
PS13, Line 2226: getAvgRowSize() - getRowPadSize()
> nit: create and use getAvgRowSizeWithoutPad().
Done


http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
File fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java:

http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java@162
PS13, Line 162: inputNode.getAvgRowSize() - inputNode.getRowPadSize()
> nit: Create and use getAvgRowSizeWithoutPad() in PlanNode.java (see my comm
Done


http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java@179
PS13, Line 179: LOG.trace("HdfsTableSink insert CPU cost estimate: " + 
totalCost + ", Cardinality: "
              :           + cardinality + ", Estimated Bytes Inserted: " + 
estBytesInserted);
> nit: Please log which branch from above that totalCost is calculated from.
Done


http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/NestedLoopJoinNode.java
File fe/src/main/java/org/apache/impala/planner/NestedLoopJoinNode.java:

http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/NestedLoopJoinNode.java@117
PS13, Line 117: buildCardinality / fragment_.getNumInstancesForCosting()
> Does this require Math.ceil()?
Done


http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/PlanNode.java
File fe/src/main/java/org/apache/impala/planner/PlanNode.java:

http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/PlanNode.java@251
PS13, Line 251: public float getAvgRowSize() { return avgRowSize_; }
              :   public float getRowPadSize() { return rowPadSize_; }
> Consider add the following and use it when relevant:
Done


http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/ProcessingCost.java
File fe/src/main/java/org/apache/impala/planner/ProcessingCost.java:

http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/ProcessingCost.java@88
PS13, Line 88: public static ProcessingCost basicCost(
             :       String label, long cardinality, float exprsCost, float 
materializationCost) {
> Looks like only ScanNode.java call this method. Consider removing this and
There are still callers of ScanNode.computeScanProcessing cost that haven't 
been converted to the new benchmark-based costing and their costing still 
relies on computation using exprs and materailization cost. Later when those 
are all converted to the benchmark-based costing we can clean this up.


http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/ProcessingCost.java@98
PS13, Line 98: computeValidBaseCost(cardinality, exprsCost, 0)
> Consider replacing this with computeValidBaseCost(double totalCost).
I think there are still operators that rely on this form of basic cost because 
they haven't been converted to the benchmark-based costing. Will clean this up 
later when all uses are gone.


http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/UnionNode.java
File fe/src/main/java/org/apache/impala/planner/UnionNode.java:

http://gerrit.cloudera.org:8080/#/c/21279/13/fe/src/main/java/org/apache/impala/planner/UnionNode.java@175
PS13, Line 175: LOG.trace("Union CPU cost estimate: " + totalCost
              :           + ", estBytesMaterialized: " + estBytesMaterialized
              :           + ", totalMaterializedCardinality: " + 
totalMaterializedCardinality);
> nit: log resultExprLists_.size().
Done



--
To view, visit http://gerrit.cloudera.org:8080/21279
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icf1edd48d4ae255b7b3b7f5b228800d7bac7d2ca
Gerrit-Change-Number: 21279
Gerrit-PatchSet: 13
Gerrit-Owner: David Rorke <dro...@cloudera.com>
Gerrit-Reviewer: Abhishek Rawat <ara...@cloudera.com>
Gerrit-Reviewer: David Rorke <dro...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kdesc...@cloudera.com>
Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Comment-Date: Wed, 17 Apr 2024 18:51:58 +0000
Gerrit-HasComments: Yes

Reply via email to