Re: [PR] docs: refine `AggregateUDFImpl::is_ordered_set_aggregate` documentation [datafusion]

via GitHub Sat, 18 Oct 2025 04:35:04 -0700


alamb commented on PR #17805:
URL: https://github.com/apache/datafusion/pull/17805#issuecomment-3418295957


   > Thoughts @alamb ?
   
   Thank you for the crazy thorough research ❤️ 
   
   > What I propose:
   > 
   > * We should keep `WITHIN GROUP` optional, since we had users using 
`approx_percentile_cont` without it (so follow DuckDB here). For other ordered 
set aggregate functions we follow suit (already did so for `percentile_cont` 
here [feat: Add percentile_cont aggregate function 
#17988](https://github.com/apache/datafusion/pull/17988))
   
   I agree
   
   >   * If `WITHIN GROUP` not specified then it is implementation dependent 
what the default order is, though we should say it is ascending (we can't 
control this yet via code, only comments)
   
   I agree (and I think this is consistent with the current behavior too)
   
   > * Make `WITHIN GROUP` strictly apply only to aggregate functions which 
return `true` for `AggregateUDFImpl::is_ordered_set_aggregate()` -> [`WITHIN 
GROUP` needs to be more strict 
#18109](https://github.com/apache/datafusion/issues/18109)
   
   What do you mean "strictly apply" ? As in return an error for queries with 
`WITHIN GROUP` for an aggregate that 
`AggregateUDFImpl::is_ordered_set_aggregate() --> false`?
   
   > * `AggregateUDFImpl::is_ordered_set_aggregate()` itself doesn't apply any 
guarantees to input order, like postgres; we won't insert a sort or anything, 
it'll be up to the aggregate function itself to take the "hint" from `WITHIN 
GROUP` (not really a hint but you get the point) and do its ordering internally 
accordingly
   
   If this is consistent with what currently happens (I think it is) then it 
sounds good to me -- 
   
   > * Refactor DataFrame APIs a bit to make it easier to use, for example:
   
   Rather than just `ascending` how about taking in `expression: SortExpr` , 
which would also allow specifying `NULLS FIRST` and/or `NULLS LAST` ?
   
   > // Proposed: much more explicit instead of needing to extract `expression` 
from `Sort`
   > pub fn approx_percentile_cont(
   >     expression: Expr,
   >     percentile: Expr,
   >     ascending: bool,
   >     centroids: Option<Expr>,
   > ) -> Expr {
   >     todo!()
   > }
   
   I have one more question, which is "what should we do with just ORDER BY 
clause in the argument? For example
   ```sql
   SELECT approx_percentile_cont(expr ORDER BY time, 0.5) FROM ...
   ```
   
   Would we treat that `ORDER BY` the same as if it were specified in the 
`WITHIN GROUP` clause?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] docs: refine `AggregateUDFImpl::is_ordered_set_aggregate` documentation [datafusion]

Reply via email to