[I] Extend windowed aggregate allow-list (collect_set, collect_list, and other natively-supported aggregates) [datafusion-comet]

via GitHub Tue, 30 Jun 2026 06:45:02 -0700


andygrove opened a new issue, #4766:
URL: https://github.com/apache/datafusion-comet/issues/4766


   ### What is the problem the feature request solves?
   
   The `CometWindowExec` translator carries a hard-coded allow-list of 
aggregate functions that may run inside a window:
   
   
https://github.com/apache/datafusion-comet/blob/main/spark/src/main/scala/org/apache/spark/sql/comet/CometWindowExec.scala#L100-L141
   
   Today it accepts `Count`, `Min`, `Max`, `Sum`, `Average`, `First`, `Last`. 
Anything else hits the catch-all and tags a fallback:
   
   > aggregate \`collect_set(...)\` is not supported for window function
   
   Several aggregates that Comet *does* support in batch mode are excluded by 
this list, so queries that wrap them in `OVER (...)` fall back to Spark even 
when DataFusion's window machinery could execute them natively. Examples I've 
hit or expect to hit:
   
   - `collect_set` / `collect_list` — fully supported as batch aggregates 
(#3951, #2525)
   - `StddevSamp`, `StddevPop`, `VarSamp`, `VarPop`
   - `CorrPearson`, `Covariance`
   - `BitAnd`, `BitOr`, `BitXor`
   - `BloomFilterAggregate`
   
   ### Describe the potential solution
   
   Two paths, in increasing scope:
   
   1. **Targeted whitelist additions** — add cases to the match in 
`CometWindowExec.scala:100` for each aggregate where DataFusion's window 
operator can already evaluate it, gated on the relevant data-type checks (the 
existing `Sum`/`Average`/`Min`/`Max` arms already follow this pattern). 
`collect_set` and `collect_list` are the cheapest to land first because they 
have no sliding-frame subtleties when the frame is unbounded.
   
   2. **Generalize dispatch** — replace the allow-list with a single check that 
delegates to the same `AggSerde` path used by the batch aggregate operator, and 
let DataFusion reject anything it can't run. This stops the gap from re-opening 
every time a new aggregate is added.
   
   Either way, sliding-frame semantics need consideration for the 
non-invertible aggregates (`collect_*`, the variance/covariance family), since 
DataFusion may have to recompute each frame from scratch. Spark behaves the 
same way, so correctness shouldn't be at risk — performance may be.
   
   ### Additional context
   
   Related:
   - #4007 (closed) enabled the current window-aggregate allow-list
   - #4731 — `AVG(decimal)` window fallback (same class of gap, different 
aggregate)
   - #4724 — multi-stage `collect_list` / `collect_set` (batch path, not 
windows)
   
   This issue tracks the **window** side specifically.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Extend windowed aggregate allow-list (collect_set, collect_list, and other natively-supported aggregates) [datafusion-comet]

Reply via email to