andygrove opened a new issue, #4552: URL: https://github.com/apache/datafusion-comet/issues/4552
## Background Spark provides the SQL standard linear-regression aggregate functions `regr_*(y, x)`, which compute single-pass statistics over rows where both `y` and `x` are non-null. They are standard descriptive statistics (same set as PostgreSQL), not ML. SQL file tests added in #4551 (`spark/src/test/resources/sql-tests/expressions/aggregate/regr.sql`) establish the current state empirically: - **Already accelerated natively:** `regr_count`, `regr_avgx`, `regr_avgy`. Spark implements these as `RuntimeReplaceableAggregate`s that lower to `Count` / `Average`, which Comet already supports, so they run without any new code. - **Currently fall back to Spark (this issue):** `regr_sxx`, `regr_syy`, `regr_sxy`, `regr_slope`, `regr_intercept`, `regr_r2`. In #4551 these are covered with `query spark_answer_only` (correctness only). ## Proposal Add native Comet support for the six functions that currently fall back. None of these are greenfield: they all build on the streaming moment accumulators Comet already implements for `covar_pop`, `var_pop`, and `corr`. From Spark's `linearRegression.scala`: - `regr_sxy` extends `Covariance` (same accumulator as Comet's `covar_pop`/`covar_samp`), with a different final expression. - `regr_r2` extends `PearsonCorrelation` (same accumulator as Comet's `corr`), with a different final expression. - `regr_slope` and `regr_intercept` are `DeclarativeAggregate`s composing `CovPopulation` + `VariancePop`, with a null-pair guard on the variance update. - `regr_sxx` and `regr_syy` are `RuntimeReplaceableAggregate`s that lower to an internal `RegrReplacement` declarative aggregate (count times variance over the non-null pairs). So the work is to wire these aggregate classes through `QueryPlanSerde` / the aggregate serde and expose the appropriate final expressions over the existing native accumulators, plus match Spark's null-pair filtering semantics exactly. ## Acceptance criteria - The six functions execute natively in Comet and match Spark. - Update `expressions/aggregate/regr.sql` (from #4551) to switch these queries from `query spark_answer_only` back to the default `query` mode so native execution is asserted. ## Notes These could be tackled incrementally (for example `regr_sxy` and `regr_r2` first, since they map most directly onto the existing covariance and correlation accumulators). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
