Github user maryannxue commented on a diff in the pull request:
https://github.com/apache/spark/pull/19488#discussion_r144543377
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
---
@@ -205,14 +205,15 @@ object PhysicalAggregation {
case logical.Aggregate(groupingExpressions, resultExpressions, child)
=>
// A single aggregate expression might appear multiple times in
resultExpressions.
// In order to avoid evaluating an individual aggregate function
multiple times, we'll
- // build a set of the distinct aggregate expressions and build a
function which can
+ // build a map of the distinct aggregate expressions and build a
function which can
// be used to re-write expressions so that they reference the single
copy of the
// aggregate function which actually gets computed.
- val aggregateExpressions = resultExpressions.flatMap { expr =>
+ val aggregateExpressionMap = resultExpressions.flatMap { expr =>
expr.collect {
- case agg: AggregateExpression => agg
+ case agg: AggregateExpression => (agg.canonicalized,
agg.deterministic) -> agg
--- End diff --
Thank you, @cloud-fan! You've made a very good point! We should address
non-deterministic correctly here. I have confirmed, though, that all aggregate
functions should be deterministic (and fortunately Spark already guarantees
this in the analysis stage). So I think we are safe to use "agg.canonicalized"
without considering the "deterministic" field here, and so as not to be
confusing. I have also added a code comment addressing this point in my latest
commit.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]