tools4origins commented on pull request #30107:
URL: https://github.com/apache/spark/pull/30107#issuecomment-713454654


   Thank you @zero323 and all for your feedback. I agree with you too. I did 
not have your solution in mind. The impact on the DSL is indeed high as it 
introduces a new API pattern (a function that only applies on aggregation).
   
   For completeness in case someone look at this issue in the future I am 
referencing here how to handle filtered aggregations with your approach:
   
   - `count(1)`: `count(when(df("id") < 50, 1))`
   - `count(*)`: `count(when(df("id") < 50, 1))`  (as `when` does not support 
`*`)
   - `count(id)`: `count(when(df("id") < 50, df("id")))`
   - Other aggregations, e.g. `avg(id)`: `avg(when(df("id") < 50, df("id")))`
   
   I was wondering if the same approach would work with distinct aggregations. 
I think in this case it is needed to use `expr()` but expr does indeed the job: 
`expr("stddev(distinct colName")`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to