Jefffrey commented on PR #17805:
URL: https://github.com/apache/datafusion/pull/17805#issuecomment-3413963574

   Got around to doing some research.
   
   **Expectation (aka Postgres)**
   
   `WITHIN GROUP` clause is required for the ordered set aggregate functions; 
you can't omit them so an order by must always be set. The functions themselves 
(e.g. percentile_cont) do the sort, see this description from the postgres git 
commit:
   
   > Unlike the case for normal aggregates, the sorting of input rows for an 
ordered-set aggregate is _not_ done behind the scenes, but is the 
responsibility of the aggregate's support functions. The typical implementation 
approach is to keep a reference to a `tuplesort` object in the aggregate's 
state value, feed the incoming rows into that object, and then complete the 
sorting and read out the data in the final function.  This design allows the 
final function to perform special operations such as injecting additional 
`hypothetical` rows into the data to be sorted.
   
   I'm not familiar with postgres internals but it reads like it expects the 
aggregate functions to keep the state themselves and do their own sort; `WITHIN 
GROUP` doesn't guarantee anything about input order to the function.
   
   They also make a point about "direct" arguments:
   
   > The aggregates we have been describing so far are "normal" aggregates.  
PostgreSQL also supports _ordered-set aggregates_, which differ from normal 
aggregates in two key ways.  First, in addition to ordinary aggregated 
arguments that are evaluated once per input row, an ordered-set aggregate can 
have "direct" arguments that are evaluated only once per aggregation operation. 
 Second, the syntax for the ordinary aggregated arguments specifies a sort 
ordering for them explicitly.  An ordered-set aggregate is usually used to 
implement a computation that depends on a specific row ordering, for instance 
rank or percentile, so that the sort ordering is a required aspect of any call.
   
   For example, the percentile value like `0.5` in percentile_cont would be a 
direct argument.
   
   **DuckDB**
   
   DuckDB names some functions differently (`quantile_cont` instead of 
`percentile_cont`) and it doesn't require the `WITHIN GROUP` syntax:
   
   ```sql
   D select quantile_cont(col0, 1) from values (1), (2) t;
   ┌────────────────────────┐
   │ quantile_cont(col0, 1) │
   │         double         │
   ├────────────────────────┤
   │          2.0           │
   └────────────────────────┘
   ```
   
   However you _can_ use the `WITHIN GROUP` syntax along with `percentile_cont` 
name:
   
   ```sql
   D select percentile_cont(1) within group (order by col0) from values (1), 
(2) t;
   ┌────────────────────────────────┐
   │ quantile_cont(1 ORDER BY col0) │
   │             double             │
   ├────────────────────────────────┤
   │              2.0               │
   └────────────────────────────────┘
   D select quantile_cont(1) within group (order by col0) from values (1), (2) 
t;
   Parser Error:
   Unknown ordered aggregate "quantile_cont".
   D select percentile_cont(col0, 1) from values (1), (2) t;
   Catalog Error:
   Scalar Function with name percentile_cont does not exist!
   Did you mean "pi"?
   
   LINE 1: select percentile_cont(col0, 1) from values (1), (2) t;
   ```
   - Note how they _don't_ allow `quantile_cont` for `WITHIN GROUP`, only 
`percentile_cont` which is also locked to `WITHIN GROUP` (it cannot be used 
like `quantile_cont`)
   
   However DuckDB allows specifying order by for `quantile_cont` without 
`WITHIN GROUP`:
   
   ```sql
   D select quantile_cont(1 order by col0) from values (1), (2) t;
   ┌────────────────────────────────┐
   │ quantile_cont(1 ORDER BY col0) │
   │             double             │
   ├────────────────────────────────┤
   │              2.0               │
   └────────────────────────────────┘
   ```
   
   But it doesn't allow using both:
   
   ```sql
   D select quantile_cont(1 order by col0) within group (order by col0) from 
values (1), (2) t;
   Parser Error:
   cannot use multiple ORDER BY clauses with WITHIN GROUP
   
   LINE 1: select quantile_cont(1 order by col0) within group (order by col0) 
from values (1), (2) t;
                                                 ^
   ```
   
   **Actual (whats currently implemented in DataFusion)**
   
   (I'm ignoring anything related to schema name or unparser, focus only on 
functionality of aggregate).
   
   It seems all `AggregateUDFImpl::is_ordered_set_aggregate()` does is during 
SQL parsing it does the necessary magic to put the `ORDER BY` column as the 
argument to the aggregate function:
   
   
https://github.com/apache/datafusion/blob/c84e3cf5a5a9f4f4b2a0f44a03a90ff0b9461df7/datafusion/sql/src/expr/function.rs#L450-L469
   
   - i.e. essentially rewrites `percentile_cont(0.5) within group (order by 
col0)` to `percentile_cont(num, 0.5)` and passes along the sort order too
   
   This is **all** we currently do from what I see; we don't disallow using 
`WITHIN GROUP` on other aggregate functions (see #18109). Though our DataFrame 
API functions for the ordered set aggregate functions do require an explicit 
sort order as I mentioned before:
   
   
https://github.com/apache/datafusion/blob/c84e3cf5a5a9f4f4b2a0f44a03a90ff0b9461df7/datafusion/functions-aggregate/src/approx_percentile_cont.rs#L55-L60
   
   **References**
   
   [Postgres git commit of ordered set 
aggregates](https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=8d65da1f01c6a4c84fe9c59aeb6b7e3adf870145)
   - Also a [nice 
article](https://paquier.xyz/postgresql-2/postgres-9-4-feature-highlight-within-group/)
 explaining the commit a bit
   
   Postgres documentation:
   - https://www.postgresql.org/docs/9.4/sql-expressions.html#SYNTAX-AGGREGATES
   - 
https://www.postgresql.org/docs/9.4/functions-aggregate.html#FUNCTIONS-ORDEREDSET-TABLE
   
   DuckDB:
   - 
https://duckdb.org/docs/stable/sql/functions/aggregates#ordered-set-aggregate-functions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to