tanelk commented on pull request #29810:
URL: https://github.com/apache/spark/pull/29810#issuecomment-696096481


   > Maybe I am missing something here. AFAIK the problem with 
First/Last/CollectList methods is that we can't control how results are merged. 
This depends on how we shuffle fetches results and this is not deterministic.
   
   You are 100% correct. As a user, this is how I would also understand the 
term deterministic.
   But, internally deterministic has different meaning - by this definition 
`Sum` should be also non-deterministic if its input type is float or double. 
   
   I'll copy our internal definition:
   ```
      * Note that this means that an expression should be considered as 
non-deterministic if:
      * - it relies on some mutable internal state, or
      * - it relies on some implicit input that is not part of the children 
expression list.
      * - it has non-deterministic child or children.
      * - it assumes the input satisfies some certain condition via the child 
operator.
   ```
   For aggregation expressions the internal state part can introduce extra 
confusion - of course all of them have some internal state about the current 
group they are aggregating (running count, largest value seen  so far, etc), 
but they do not "remember" the previous groups they have aggregated. 
   
   There is a separate optimizer rule `EliminateSorts`, that keeps track of 
aggregators, that do not depend on input order - max, count, etc. But these are 
a subset of all deterministic aggregators.
   
   For context, why this is relevant:
   A snippet from `PushPredicateThroughNonJoin`
   
https://github.com/apache/spark/blob/c336ddfdb81dd5c27fd109d62138dc129a02c30b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1138-L1141
   
   Basically this case will filter out groups in the aggregation before 
aggregating the values. Within one group the aggregator will still see all the 
same rows in the same order, but it would not see the groups, that were 
filtered out. This would change the output of an aggregator, that remembers 
previous groups (non-deterministic), but it would not change the output of an 
aggregator, that only cares about the current group (deterministic, but 
possibly order relevant).
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to