andygrove opened a new issue, #4766: URL: https://github.com/apache/datafusion-comet/issues/4766
### What is the problem the feature request solves? The `CometWindowExec` translator carries a hard-coded allow-list of aggregate functions that may run inside a window: https://github.com/apache/datafusion-comet/blob/main/spark/src/main/scala/org/apache/spark/sql/comet/CometWindowExec.scala#L100-L141 Today it accepts `Count`, `Min`, `Max`, `Sum`, `Average`, `First`, `Last`. Anything else hits the catch-all and tags a fallback: > aggregate \`collect_set(...)\` is not supported for window function Several aggregates that Comet *does* support in batch mode are excluded by this list, so queries that wrap them in `OVER (...)` fall back to Spark even when DataFusion's window machinery could execute them natively. Examples I've hit or expect to hit: - `collect_set` / `collect_list` — fully supported as batch aggregates (#3951, #2525) - `StddevSamp`, `StddevPop`, `VarSamp`, `VarPop` - `CorrPearson`, `Covariance` - `BitAnd`, `BitOr`, `BitXor` - `BloomFilterAggregate` ### Describe the potential solution Two paths, in increasing scope: 1. **Targeted whitelist additions** — add cases to the match in `CometWindowExec.scala:100` for each aggregate where DataFusion's window operator can already evaluate it, gated on the relevant data-type checks (the existing `Sum`/`Average`/`Min`/`Max` arms already follow this pattern). `collect_set` and `collect_list` are the cheapest to land first because they have no sliding-frame subtleties when the frame is unbounded. 2. **Generalize dispatch** — replace the allow-list with a single check that delegates to the same `AggSerde` path used by the batch aggregate operator, and let DataFusion reject anything it can't run. This stops the gap from re-opening every time a new aggregate is added. Either way, sliding-frame semantics need consideration for the non-invertible aggregates (`collect_*`, the variance/covariance family), since DataFusion may have to recompute each frame from scratch. Spark behaves the same way, so correctness shouldn't be at risk — performance may be. ### Additional context Related: - #4007 (closed) enabled the current window-aggregate allow-list - #4731 — `AVG(decimal)` window fallback (same class of gap, different aggregate) - #4724 — multi-stage `collect_list` / `collect_set` (batch path, not windows) This issue tracks the **window** side specifically. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
