[GitHub] [spark] tanelk commented on pull request #29810: [SPARK-32940][SQL] Collect, first and last should be deterministic aggregate functions

GitBox Mon, 21 Sep 2020 05:58:46 -0700


tanelk commented on pull request #29810:
URL: https://github.com/apache/spark/pull/29810#issuecomment-696096481

> Maybe I am missing something here. AFAIK the problem with
First/Last/CollectList methods is that we can't control how results are merged.
This depends on how we shuffle fetches results and this is not deterministic.

You are 100% correct. As a user, this is how I would also understand the
term deterministic.
But, internally deterministic has different meaning - by this definition
`Sum` should be also non-deterministic if its input type is float or double.

I'll copy our internal definition:
```
* Note that this means that an expression should be considered as
non-deterministic if:
* - it relies on some mutable internal state, or
* - it relies on some implicit input that is not part of the children
expression list.
* - it has non-deterministic child or children.
* - it assumes the input satisfies some certain condition via the child
operator.
```
For aggregation expressions the internal state part can introduce extra
confusion - of course all of them have some internal state about the current
group they are aggregating (running count, largest value seen so far, etc),
but they do not "remember" the previous groups they have aggregated.

There is a separate optimizer rule `EliminateSorts`, that keeps track of
aggregators, that do not depend on input order - max, count, etc. But these are
a subset of all deterministic aggregators.

For context, why this is relevant:
A snippet from `PushPredicateThroughNonJoin`

https://github.com/apache/spark/blob/c336ddfdb81dd5c27fd109d62138dc129a02c30b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1138-L1141

Basically this case will filter out groups in the aggregation before
aggregating the values. Within one group the aggregator will still see all the
same rows in the same order, but it would not see the groups, that were
filtered out. This would change the output of an aggregator, that remembers
previous groups (non-deterministic), but it would not change the output of an
aggregator, that only cares about the current group (deterministic, but
possibly order relevant).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] tanelk commented on pull request #29810: [SPARK-32940][SQL] Collect, first and last should be deterministic aggregate functions

Reply via email to