andygrove opened a new issue, #4552:
URL: https://github.com/apache/datafusion-comet/issues/4552

   ## Background
   
   Spark provides the SQL standard linear-regression aggregate functions 
`regr_*(y, x)`, which compute single-pass statistics over rows where both `y` 
and `x` are non-null. They are standard descriptive statistics (same set as 
PostgreSQL), not ML.
   
   SQL file tests added in #4551 
(`spark/src/test/resources/sql-tests/expressions/aggregate/regr.sql`) establish 
the current state empirically:
   
   - **Already accelerated natively:** `regr_count`, `regr_avgx`, `regr_avgy`. 
Spark implements these as `RuntimeReplaceableAggregate`s that lower to `Count` 
/ `Average`, which Comet already supports, so they run without any new code.
   - **Currently fall back to Spark (this issue):** `regr_sxx`, `regr_syy`, 
`regr_sxy`, `regr_slope`, `regr_intercept`, `regr_r2`. In #4551 these are 
covered with `query spark_answer_only` (correctness only).
   
   ## Proposal
   
   Add native Comet support for the six functions that currently fall back. 
None of these are greenfield: they all build on the streaming moment 
accumulators Comet already implements for `covar_pop`, `var_pop`, and `corr`. 
From Spark's `linearRegression.scala`:
   
   - `regr_sxy` extends `Covariance` (same accumulator as Comet's 
`covar_pop`/`covar_samp`), with a different final expression.
   - `regr_r2` extends `PearsonCorrelation` (same accumulator as Comet's 
`corr`), with a different final expression.
   - `regr_slope` and `regr_intercept` are `DeclarativeAggregate`s composing 
`CovPopulation` + `VariancePop`, with a null-pair guard on the variance update.
   - `regr_sxx` and `regr_syy` are `RuntimeReplaceableAggregate`s that lower to 
an internal `RegrReplacement` declarative aggregate (count times variance over 
the non-null pairs).
   
   So the work is to wire these aggregate classes through `QueryPlanSerde` / 
the aggregate serde and expose the appropriate final expressions over the 
existing native accumulators, plus match Spark's null-pair filtering semantics 
exactly.
   
   ## Acceptance criteria
   
   - The six functions execute natively in Comet and match Spark.
   - Update `expressions/aggregate/regr.sql` (from #4551) to switch these 
queries from `query spark_answer_only` back to the default `query` mode so 
native execution is asserted.
   
   ## Notes
   
   These could be tackled incrementally (for example `regr_sxy` and `regr_r2` 
first, since they map most directly onto the existing covariance and 
correlation accumulators).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to