c21 commented on a change in pull request #35273:
URL: https://github.com/apache/spark/pull/35273#discussion_r791326607
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/SortAggregateExec.scala
##########
@@ -44,8 +42,7 @@ case class SortAggregateExec(
with AliasAwareOutputOrdering {
override lazy val metrics = Map(
- "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output
rows"),
- "aggTime" -> SQLMetrics.createTimingMetric(sparkContext, "time in
aggregation build"))
Review comment:
@cloud-fan - for hash agg and object hash agg, yes, the `aggTime` is the
time to build the hash table. For sort aggregate, there's no proper definition
of agg time, as the whole time during sort aggregate is doing aggregate.
There's not that useful to measure the execution time for sort aggregate.
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/SortAggregateExec.scala
##########
@@ -44,8 +42,7 @@ case class SortAggregateExec(
with AliasAwareOutputOrdering {
override lazy val metrics = Map(
- "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output
rows"),
- "aggTime" -> SQLMetrics.createTimingMetric(sparkContext, "time in
aggregation build"))
Review comment:
@cloud-fan - for hash agg and object hash agg, yes, the `aggTime` is the
time to build the hash table. For sort aggregate, there's no proper definition
of agg time, as the whole time during sort aggregate is doing aggregate. It's
not that useful to measure the execution time for sort aggregate.
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/SortAggregateExec.scala
##########
@@ -44,8 +42,7 @@ case class SortAggregateExec(
with AliasAwareOutputOrdering {
override lazy val metrics = Map(
- "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output
rows"),
- "aggTime" -> SQLMetrics.createTimingMetric(sparkContext, "time in
aggregation build"))
Review comment:
> if there is no group col, does HashAggregate always report aggTime as
0?
No, `aggTime` is the time for `HashAggregate` to aggregate all input rows.
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/SortAggregateExec.scala
##########
@@ -44,8 +42,7 @@ case class SortAggregateExec(
with AliasAwareOutputOrdering {
override lazy val metrics = Map(
- "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output
rows"),
- "aggTime" -> SQLMetrics.createTimingMetric(sparkContext, "time in
aggregation build"))
Review comment:
@cloud-fan - yeah, I agree with you, that `aggTime` is misleading. It's
the time to build hash table when group columns exist.
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregateCodegenSupport.scala
##########
@@ -47,6 +47,12 @@ trait AggregateCodegenSupport
*/
private var bufVars: Seq[Seq[ExprCode]] = _
+ /**
+ * Whether this operator has an aggregate build phase.
+ * Only [[HashAggregateExec]] has build phase now.
+ */
+ protected def hasAggBuild: Boolean
Review comment:
do we want to report `aggTime` as 0 now for `HashAggregate` without
group by columns?
I am fine with changing to report 0 now, as the SQL metrics shown up as
`time in aggregation build` in Spark UI for end users. There's no build for
`HashAggregate` without group by columns, so I think it's a valid fix. Just
want to mention here as this is a user-facing change.
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregateCodegenSupport.scala
##########
@@ -47,6 +47,12 @@ trait AggregateCodegenSupport
*/
private var bufVars: Seq[Seq[ExprCode]] = _
+ /**
+ * Whether this operator has an aggregate build phase.
+ * Only [[HashAggregateExec]] has build phase now.
+ */
+ protected def hasAggBuild: Boolean
Review comment:
@cloud-fan - sure, changed to `needHashTable`. Leave
`HashAggregateExec.needHashTable` to be true, to maintain existing behavior for
now.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]