[PR] feat: add GroupsAccumulator for variance, stddev, covariance, correlation [datafusion-comet]

via GitHub Wed, 06 May 2026 16:12:06 -0700


andygrove opened a new pull request, #4254:
URL: https://github.com/apache/datafusion-comet/pull/4254


   ## Which issue does this PR close?
   
   Closes #1627.
   
   ## Rationale for this change
   
   DataFusion's grouped aggregate operator has a fast path that uses
   `GroupsAccumulator` instead of one `Accumulator` per group. Comet's
   existing variance, stddev, covariance, and correlation aggregates only
   implement the per-row `Accumulator` trait, so grouped queries fall back
   to a slower row-at-a-time path. Issue #1275 measured a roughly 4x
   speedup from applying the same change to `avg`. After this PR the four
   remaining Comet-owned aggregates are on parity with `avg` / `sum_int`
   / `sum_decimal` / `avg_decimal`, which already implement the fast path.
   
   ## What changes are included in this PR?
   
   - New `agg_funcs/welford.rs` with the Welford update/merge math used by
     both the per-row and grouped paths. The existing `VarianceAccumulator`
     and `CovarianceAccumulator` per-row methods now route through these
     helpers (no behavior change).
   - New `VarianceGroupsAccumulator`, `StddevGroupsAccumulator`,
     `CovarianceGroupsAccumulator`, and `CorrelationGroupsAccumulator`.
     Each `AggregateUDFImpl` now returns `true` from
     `groups_accumulator_supported` and constructs the matching grouped
     accumulator. Stddev wraps variance; correlation wraps one covariance
     plus two variance group accumulators, mirroring the existing
     `Accumulator` composition.
   - The grouped accumulators preserve Spark wire format (`f64` counts) and
     Spark semantics (`null_on_divide_by_zero` for the `count == 1` branches).
   
   ## How are these changes tested?
   
   - New Rust unit tests cover single-group correctness, multi-group
     correctness, null handling on each input column (correlation needs
     both columns non-null per row), `opt_filter` masking, multi-batch
     merge equals single-shot, and the empty-group-yields-null and
     single-row-sample edge cases.
   - The grouped paths are picked up automatically by DataFusion's hash
     aggregate, so existing JVM coverage in `CometAggregateSuite` exercises
     the new code end-to-end. CI runs the full sweep across Spark 3.4,
     3.5, and 4.0.
   
   This PR was scaffolded with the project's superpowers:brainstorming and
   superpowers:writing-plans skills.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat: add GroupsAccumulator for variance, stddev, covariance, correlation [datafusion-comet]

Reply via email to