andygrove commented on issue #4552:
URL:
https://github.com/apache/datafusion-comet/issues/4552#issuecomment-4608868039
## Investigation: native wire-up via DataFusion's `regr_*_udaf` does not
work as-is
Tried wiring DataFusion's `regr_slope_udaf`, `regr_intercept_udaf`,
`regr_r2_udaf`, `regr_sxx_udaf`, `regr_syy_udaf`, `regr_sxy_udaf` (all present
in `datafusion-functions-aggregate` 53.1) through a new `Regr` proto message +
Scala serdes + planner match arm. Hit a hard wall at the JNI handoff:
```
Comet Internal Error: Output column count mismatch: expected 3, got 6
at native/core/src/execution/jni_api.rs:637
```
`expected 3` is what JVM allocates from Spark's partial-aggregate buffer
schema; `got 6` is what DataFusion's regr accumulator emits as its state. They
don't agree, and they can't — Spark's and DataFusion's regr accumulators were
designed independently.
### Why each function fails the same way
After Spark's analyzer rewrites:
| Spark expr at plan time | Buffer fields | DataFusion `regr_*_udaf` state
fields |
|---|---|---|
| `RegrReplacement` (← `regr_sxx`/`regr_syy`) — extends `CentralMomentAgg`
momentOrder=2 | 3 (`n, avg, m2`) | 6 |
| `RegrSXY` extends `Covariance(y, x, true)` | 4 (`n, xAvg, yAvg, ck`) | 6 |
| `RegrR2` extends `PearsonCorrelation(y, x, true)` | 6 (`n, xAvg, yAvg, ck,
xMk, yMk`) | 6 |
| `RegrSlope` (`DeclarativeAggregate` of `CovPopulation` + `VariancePop`) |
7 | 6 |
| `RegrIntercept` (same) | 7 | 6 |
The first three (`regr_count`/`regr_avgx`/`regr_avgy`) already work because
Spark lowers them to `Count`/`Average`, which Comet's existing UDAFs already
match.
### Why codegen dispatch is not an option
`CometScalaUDF.scala:40-44` is explicit: aggregate UDFs (`ScalaAggregator`,
`TypedImperativeAggregate`, legacy UDAF) are not covered by the dispatcher. The
dispatcher emits Spark's `doGenCode` as a per-batch Janino kernel — stateless.
Aggregates have buffers that span batches and split into partial/merge/final
phases, which the dispatcher doesn't model.
### The path that does work
Custom Comet UDAFs under `native/spark-expr/src/agg_funcs/` whose
`state_fields()` match Spark's buffer schema exactly. This is exactly what
Comet already does for `Covariance`, `Correlation`, `Variance`, `Stddev` — and
the regr_* family piggy-backs cleanly on those existing accumulators per
linearRegression.scala:
- `regr_sxy` reuses `Covariance`'s 4-field buffer; only the final eval
changes (`If(n === 0, null, ck)` vs. `ck/(n-1)` or `ck/n`).
- `regr_r2` reuses `PearsonCorrelation`'s 6-field buffer; final eval is
`If(xMk === 0, null, If(yMk === 0, 1.0, (ck*ck)/(xMk*yMk)))`.
- `regr_sxx`/`regr_syy` reuse `CentralMomentAgg`'s 3-field buffer (after
recovering `(y, x)` from the Spark-side `RegrReplacement(If(IsNull(y) ∨
IsNull(x), null, x_or_y))` rewrite shape — the inner `branch` distinguishes SXX
from SYY); final eval returns raw `m2`.
- `regr_slope`/`regr_intercept` need a new combined-buffer accumulator: 7
fields = 4 (cov: `n, xAvg, yAvg, ck`) + 3 (var: `n_v, avg_v, m2_v`), with the
var-update gated behind the same null-pair check Spark applies.
### Suggested incremental order
Mirrors the issue's "could be tackled incrementally":
1. **`regr_sxy`** + **`regr_r2`** — single-mode extensions to `Covariance` /
`Correlation` UDAFs; tiny diffs.
2. **`regr_sxx`** + **`regr_syy`** — extend `Variance` UDAF with a "raw m2"
eval mode + Scala-side pattern match on `RegrReplacement` to recover `(y, x)`.
3. **`regr_slope`** + **`regr_intercept`** — net-new combined-buffer UDAF.
### Edge case to mind
DataFusion's `regr_r2` returns `null` when `var(y) = 0`; Spark returns
`1.0`. The `regr.sql` test data doesn't exercise this case, but the custom
Comet UDAF must mirror Spark's branch (`yMk === 0 → 1.0`), not DataFusion's.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]