[GitHub] [spark] c21 commented on a change in pull request #35273: [SPARK-37983][SQL] Back out agg build time metrics from sort aggregate

GitBox Tue, 25 Jan 2022 08:44:26 -0800


c21 commented on a change in pull request #35273:
URL: https://github.com/apache/spark/pull/35273#discussion_r791326607




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/SortAggregateExec.scala
##########
@@ -44,8 +42,7 @@ case class SortAggregateExec(
   with AliasAwareOutputOrdering {
 
   override lazy val metrics = Map(
-    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output 
rows"),
-    "aggTime" -> SQLMetrics.createTimingMetric(sparkContext, "time in 
aggregation build"))

Review comment:
       @cloud-fan - for hash agg and object hash agg, yes, the `aggTime` is the 
time to build the hash table. For sort aggregate, there's no proper definition 
of agg time, as the whole time during sort aggregate is doing aggregate. 
There's not that useful to measure the execution time for sort aggregate.

##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/SortAggregateExec.scala
##########
@@ -44,8 +42,7 @@ case class SortAggregateExec(
   with AliasAwareOutputOrdering {
 
   override lazy val metrics = Map(
-    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output 
rows"),
-    "aggTime" -> SQLMetrics.createTimingMetric(sparkContext, "time in 
aggregation build"))

Review comment:
       @cloud-fan - for hash agg and object hash agg, yes, the `aggTime` is the 
time to build the hash table. For sort aggregate, there's no proper definition 
of agg time, as the whole time during sort aggregate is doing aggregate. It's 
not that useful to measure the execution time for sort aggregate.

##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/SortAggregateExec.scala
##########
@@ -44,8 +42,7 @@ case class SortAggregateExec(
   with AliasAwareOutputOrdering {
 
   override lazy val metrics = Map(
-    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output 
rows"),
-    "aggTime" -> SQLMetrics.createTimingMetric(sparkContext, "time in 
aggregation build"))

Review comment:
       > if there is no group col, does HashAggregate always report aggTime as 
0?
   
   No, `aggTime` is the time for `HashAggregate` to aggregate all input rows.

##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/SortAggregateExec.scala
##########
@@ -44,8 +42,7 @@ case class SortAggregateExec(
   with AliasAwareOutputOrdering {
 
   override lazy val metrics = Map(
-    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output 
rows"),
-    "aggTime" -> SQLMetrics.createTimingMetric(sparkContext, "time in 
aggregation build"))

Review comment:
       @cloud-fan - yeah, I agree with you, that `aggTime` is misleading. It's 
the time to build hash table when group columns exist.

##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregateCodegenSupport.scala
##########
@@ -47,6 +47,12 @@ trait AggregateCodegenSupport
    */
   private var bufVars: Seq[Seq[ExprCode]] = _
 
+  /**
+   * Whether this operator has an aggregate build phase.
+   * Only [[HashAggregateExec]] has build phase now.
+   */
+  protected def hasAggBuild: Boolean

Review comment:
       do we want to report `aggTime` as 0 now for `HashAggregate` without 
group by columns?
   
   I am fine with changing to report 0 now, as the SQL metrics shown up as 
`time in aggregation build` in Spark UI for end users. There's no build for 
`HashAggregate` without group by columns, so I think it's a valid fix. Just 
want to mention here as this is a user-facing change.

##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregateCodegenSupport.scala
##########
@@ -47,6 +47,12 @@ trait AggregateCodegenSupport
    */
   private var bufVars: Seq[Seq[ExprCode]] = _
 
+  /**
+   * Whether this operator has an aggregate build phase.
+   * Only [[HashAggregateExec]] has build phase now.
+   */
+  protected def hasAggBuild: Boolean

Review comment:
       @cloud-fan - sure, changed to `needHashTable`. Leave 
`HashAggregateExec.needHashTable` to be true, to maintain existing behavior for 
now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] c21 commented on a change in pull request #35273: [SPARK-37983][SQL] Back out agg build time metrics from sort aggregate

Reply via email to