tools4origins commented on pull request #30107:
URL: https://github.com/apache/spark/pull/30107#issuecomment-713454654
Thank you @zero323 and all for your feedback. I agree with you too. I did
not have your solution in mind. The impact on the DSL is indeed high as it
introduces a new API pattern (a function that only applies on aggregation).
For completeness in case someone look at this issue in the future I am
referencing here how to handle filtered aggregations with your approach:
- `count(1)`: `count(when(df("id") < 50, 1))`
- `count(*)`: `count(when(df("id") < 50, 1))` (as `when` does not support
`*`)
- `count(id)`: `count(when(df("id") < 50, df("id")))`
- Other aggregations, e.g. `avg(id)`: `avg(when(df("id") < 50, df("id")))`
I was wondering if the same approach would work with distinct aggregations.
I think in this case it is needed to use `expr()` but expr does indeed the job:
`expr("stddev(distinct colName")`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]