andygrove opened a new pull request, #4254:
URL: https://github.com/apache/datafusion-comet/pull/4254
## Which issue does this PR close?
Closes #1627.
## Rationale for this change
DataFusion's grouped aggregate operator has a fast path that uses
`GroupsAccumulator` instead of one `Accumulator` per group. Comet's
existing variance, stddev, covariance, and correlation aggregates only
implement the per-row `Accumulator` trait, so grouped queries fall back
to a slower row-at-a-time path. Issue #1275 measured a roughly 4x
speedup from applying the same change to `avg`. After this PR the four
remaining Comet-owned aggregates are on parity with `avg` / `sum_int`
/ `sum_decimal` / `avg_decimal`, which already implement the fast path.
## What changes are included in this PR?
- New `agg_funcs/welford.rs` with the Welford update/merge math used by
both the per-row and grouped paths. The existing `VarianceAccumulator`
and `CovarianceAccumulator` per-row methods now route through these
helpers (no behavior change).
- New `VarianceGroupsAccumulator`, `StddevGroupsAccumulator`,
`CovarianceGroupsAccumulator`, and `CorrelationGroupsAccumulator`.
Each `AggregateUDFImpl` now returns `true` from
`groups_accumulator_supported` and constructs the matching grouped
accumulator. Stddev wraps variance; correlation wraps one covariance
plus two variance group accumulators, mirroring the existing
`Accumulator` composition.
- The grouped accumulators preserve Spark wire format (`f64` counts) and
Spark semantics (`null_on_divide_by_zero` for the `count == 1` branches).
## How are these changes tested?
- New Rust unit tests cover single-group correctness, multi-group
correctness, null handling on each input column (correlation needs
both columns non-null per row), `opt_filter` masking, multi-batch
merge equals single-shot, and the empty-group-yields-null and
single-row-sample edge cases.
- The grouped paths are picked up automatically by DataFusion's hash
aggregate, so existing JVM coverage in `CometAggregateSuite` exercises
the new code end-to-end. CI runs the full sweep across Spark 3.4,
3.5, and 4.0.
This PR was scaffolded with the project's superpowers:brainstorming and
superpowers:writing-plans skills.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]