andygrove commented on issue #4552:
URL: 
https://github.com/apache/datafusion-comet/issues/4552#issuecomment-4608868039

   ## Investigation: native wire-up via DataFusion's `regr_*_udaf` does not 
work as-is
   
   Tried wiring DataFusion's `regr_slope_udaf`, `regr_intercept_udaf`, 
`regr_r2_udaf`, `regr_sxx_udaf`, `regr_syy_udaf`, `regr_sxy_udaf` (all present 
in `datafusion-functions-aggregate` 53.1) through a new `Regr` proto message + 
Scala serdes + planner match arm. Hit a hard wall at the JNI handoff:
   
   ```
   Comet Internal Error: Output column count mismatch: expected 3, got 6
     at native/core/src/execution/jni_api.rs:637
   ```
   
   `expected 3` is what JVM allocates from Spark's partial-aggregate buffer 
schema; `got 6` is what DataFusion's regr accumulator emits as its state. They 
don't agree, and they can't — Spark's and DataFusion's regr accumulators were 
designed independently.
   
   ### Why each function fails the same way
   
   After Spark's analyzer rewrites:
   
   | Spark expr at plan time | Buffer fields | DataFusion `regr_*_udaf` state 
fields |
   |---|---|---|
   | `RegrReplacement` (← `regr_sxx`/`regr_syy`) — extends `CentralMomentAgg` 
momentOrder=2 | 3 (`n, avg, m2`) | 6 |
   | `RegrSXY` extends `Covariance(y, x, true)` | 4 (`n, xAvg, yAvg, ck`) | 6 |
   | `RegrR2` extends `PearsonCorrelation(y, x, true)` | 6 (`n, xAvg, yAvg, ck, 
xMk, yMk`) | 6 |
   | `RegrSlope` (`DeclarativeAggregate` of `CovPopulation` + `VariancePop`) | 
7 | 6 |
   | `RegrIntercept` (same) | 7 | 6 |
   
   The first three (`regr_count`/`regr_avgx`/`regr_avgy`) already work because 
Spark lowers them to `Count`/`Average`, which Comet's existing UDAFs already 
match.
   
   ### Why codegen dispatch is not an option
   
   `CometScalaUDF.scala:40-44` is explicit: aggregate UDFs (`ScalaAggregator`, 
`TypedImperativeAggregate`, legacy UDAF) are not covered by the dispatcher. The 
dispatcher emits Spark's `doGenCode` as a per-batch Janino kernel — stateless. 
Aggregates have buffers that span batches and split into partial/merge/final 
phases, which the dispatcher doesn't model.
   
   ### The path that does work
   
   Custom Comet UDAFs under `native/spark-expr/src/agg_funcs/` whose 
`state_fields()` match Spark's buffer schema exactly. This is exactly what 
Comet already does for `Covariance`, `Correlation`, `Variance`, `Stddev` — and 
the regr_* family piggy-backs cleanly on those existing accumulators per 
linearRegression.scala:
   
   - `regr_sxy` reuses `Covariance`'s 4-field buffer; only the final eval 
changes (`If(n === 0, null, ck)` vs. `ck/(n-1)` or `ck/n`).
   - `regr_r2` reuses `PearsonCorrelation`'s 6-field buffer; final eval is 
`If(xMk === 0, null, If(yMk === 0, 1.0, (ck*ck)/(xMk*yMk)))`.
   - `regr_sxx`/`regr_syy` reuse `CentralMomentAgg`'s 3-field buffer (after 
recovering `(y, x)` from the Spark-side `RegrReplacement(If(IsNull(y) ∨ 
IsNull(x), null, x_or_y))` rewrite shape — the inner `branch` distinguishes SXX 
from SYY); final eval returns raw `m2`.
   - `regr_slope`/`regr_intercept` need a new combined-buffer accumulator: 7 
fields = 4 (cov: `n, xAvg, yAvg, ck`) + 3 (var: `n_v, avg_v, m2_v`), with the 
var-update gated behind the same null-pair check Spark applies.
   
   ### Suggested incremental order
   
   Mirrors the issue's "could be tackled incrementally":
   
   1. **`regr_sxy`** + **`regr_r2`** — single-mode extensions to `Covariance` / 
`Correlation` UDAFs; tiny diffs.
   2. **`regr_sxx`** + **`regr_syy`** — extend `Variance` UDAF with a "raw m2" 
eval mode + Scala-side pattern match on `RegrReplacement` to recover `(y, x)`.
   3. **`regr_slope`** + **`regr_intercept`** — net-new combined-buffer UDAF.
   
   ### Edge case to mind
   
   DataFusion's `regr_r2` returns `null` when `var(y) = 0`; Spark returns 
`1.0`. The `regr.sql` test data doesn't exercise this case, but the custom 
Comet UDAF must mirror Spark's branch (`yMk === 0 → 1.0`), not DataFusion's.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to