alamb commented on PR #17805: URL: https://github.com/apache/datafusion/pull/17805#issuecomment-3418295957
> Thoughts @alamb ? Thank you for the crazy thorough research ❤️ > What I propose: > > * We should keep `WITHIN GROUP` optional, since we had users using `approx_percentile_cont` without it (so follow DuckDB here). For other ordered set aggregate functions we follow suit (already did so for `percentile_cont` here [feat: Add percentile_cont aggregate function #17988](https://github.com/apache/datafusion/pull/17988)) I agree > * If `WITHIN GROUP` not specified then it is implementation dependent what the default order is, though we should say it is ascending (we can't control this yet via code, only comments) I agree (and I think this is consistent with the current behavior too) > * Make `WITHIN GROUP` strictly apply only to aggregate functions which return `true` for `AggregateUDFImpl::is_ordered_set_aggregate()` -> [`WITHIN GROUP` needs to be more strict #18109](https://github.com/apache/datafusion/issues/18109) What do you mean "strictly apply" ? As in return an error for queries with `WITHIN GROUP` for an aggregate that `AggregateUDFImpl::is_ordered_set_aggregate() --> false`? > * `AggregateUDFImpl::is_ordered_set_aggregate()` itself doesn't apply any guarantees to input order, like postgres; we won't insert a sort or anything, it'll be up to the aggregate function itself to take the "hint" from `WITHIN GROUP` (not really a hint but you get the point) and do its ordering internally accordingly If this is consistent with what currently happens (I think it is) then it sounds good to me -- > * Refactor DataFrame APIs a bit to make it easier to use, for example: Rather than just `ascending` how about taking in `expression: SortExpr` , which would also allow specifying `NULLS FIRST` and/or `NULLS LAST` ? > // Proposed: much more explicit instead of needing to extract `expression` from `Sort` > pub fn approx_percentile_cont( > expression: Expr, > percentile: Expr, > ascending: bool, > centroids: Option<Expr>, > ) -> Expr { > todo!() > } I have one more question, which is "what should we do with just ORDER BY clause in the argument? For example ```sql SELECT approx_percentile_cont(expr ORDER BY time, 0.5) FROM ... ``` Would we treat that `ORDER BY` the same as if it were specified in the `WITHIN GROUP` clause? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
