Github user maryannxue commented on a diff in the pull request:
https://github.com/apache/spark/pull/19488#discussion_r144656360
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
---
@@ -205,14 +205,17 @@ object PhysicalAggregation {
case logical.Aggregate(groupingExpressions, resultExpressions, child)
=>
// A single aggregate expression might appear multiple times in
resultExpressions.
// In order to avoid evaluating an individual aggregate function
multiple times, we'll
- // build a set of the distinct aggregate expressions and build a
function which can
+ // build a map of the distinct aggregate expressions and build a
function which can
// be used to re-write expressions so that they reference the single
copy of the
- // aggregate function which actually gets computed.
- val aggregateExpressions = resultExpressions.flatMap { expr =>
+ // aggregate function which actually gets computed. Note that
aggregate expressions
+ // should always be deterministic, so we can use its canonicalized
expression as its
--- End diff --
So we are talking about two types of "non-deterministic" here:
1. Across-query non-deterministic but in-query deterministic, which means
the same expression can produce different results over the same input between
different runs, but should always give the same result within the same run.
sum/avg on floating point numbers could be an example. Shall we make sure that
"select sum(f) - sum(f) from t" always return 0? and similarly for "first()"
maybe, should "select first_value(c) = first_value(c) over ..." always return
true?
It is important to define the behavior first, which will lead to opposite
approaches on how to handle the "deterministic" field here.
2. Across-query and in-query non-deterministic, which I don't think is
allowed anyway.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]