Github user stanzhai commented on a diff in the pull request:
https://github.com/apache/spark/pull/19301#discussion_r140699522
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala
---
@@ -72,11 +74,19 @@ object AggregateExpression {
aggregateFunction: AggregateFunction,
mode: AggregateMode,
isDistinct: Boolean): AggregateExpression = {
+ val state = if (aggregateFunction.resolved) {
+ Seq(aggregateFunction.toString, aggregateFunction.dataType,
+ aggregateFunction.nullable, mode, isDistinct)
+ } else {
+ Seq(aggregateFunction.toString, mode, isDistinct)
+ }
+ val hashCode = state.map(Objects.hashCode).foldLeft(0)((a, b) => 31 *
a + b)
+
AggregateExpression(
aggregateFunction,
mode,
isDistinct,
- NamedExpression.newExprId)
+ ExprId(hashCode))
--- End diff --
I've tried to optimize in aggregate planner
(https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala#L211).
```scala
// A single aggregate expression might appear multiple times in
resultExpressions.
// In order to avoid evaluating an individual aggregate function
multiple times, we'll
// build a set of the distinct aggregate expressions and build a
function which can
// be used to re-write expressions so that they reference the single
copy of the
// aggregate function which actually gets computed.
val aggregateExpressions = resultExpressions.flatMap { expr =>
expr.collect {
case agg: AggregateExpression =>
val aggregateFunction = agg.aggregateFunction
val state = if (aggregateFunction.resolved) {
Seq(aggregateFunction.toString, aggregateFunction.dataType,
aggregateFunction.nullable, agg.mode, agg.isDistinct)
} else {
Seq(aggregateFunction.toString, agg.mode, agg.isDistinct)
}
val hashCode = state.map(Objects.hashCode).foldLeft(0)((a, b)
=> 31 * a + b)
(hashCode, agg)
}
}.groupBy(_._1).map { case (_, values) =>
values.head._2
}.toSeq
```
But it's difficult to distinguish between different typed aggregators
without expr id. Current solution can work well for all of aggregate functions.
I'm not familiar with typed aggregators, any suggestions will be
appreciated.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]