andygrove opened a new pull request, #4542: URL: https://github.com/apache/datafusion-comet/pull/4542
## Which issue does this PR close? Part of #3190. ## Rationale for this change Comet had no native percentile aggregate, so `percentile(...)` (and the ANSI `percentile_cont(...) WITHIN GROUP`, which Spark rewrites to `Percentile`) always fell back to Spark. Codegen dispatch is not an option here: `Percentile` is a `TypedImperativeAggregate`, and the codegen dispatcher is a per-row scalar kernel that explicitly cannot run aggregates. So the only paths are native or fall back, and this PR wires it natively. DataFusion's `percentile_cont` computes the percentile with `index = p * (n - 1)` and linear interpolation between the two closest ranks, which is exactly Spark's exact `Percentile` algorithm. So the common single-percentage form matches Spark. ## What changes are included in this PR? - proto: new `Percentile` `AggExpr` message (`child`, `percentage`, `datatype`). - native planner (`planner.rs`): map `AggExprStruct::Percentile` to `percentile_cont_udaf()` with args `[child, percentile]`. - `CometPercentile` serde: `Compatible` for a single literal double percentage, default frequency, and numeric input. The child is cast to double so the native result is `DoubleType`, matching Spark. - `operators.adjustOutputForNativeState`: map Percentile's `TypedImperativeAggregate` `Binary` partial buffer to the native `List<Float64>` state (`ArrayType(DoubleType)`), mirroring the existing `CollectSet` handling, so the partial/shuffle/final exchange schema is correct. Out of scope (fall back to Spark): an array of percentages, a non-default frequency argument, and interval inputs. `approx_percentile` is deliberately not included (t-digest vs Spark's GK algorithm; tracked separately under #3189). Known minor caveat: DataFusion quantizes the interpolation fraction to 6 decimal places, so a deeply-interpolated value could in principle differ from Spark in the last ULPs. The tested percentiles match exactly; if needed this can be revisited with a custom accumulator. ## How are these changes tested? A SQL file test (`expressions/aggregate/percentile.sql`) run by `CometSqlFileTestSuite` covers global, grouped, integer-input, all-null-group, and exact and interpolated percentiles, asserting answer parity and native execution via `checkSparkAnswerAndOperator`. It also asserts that the array-of-percentages and frequency-argument forms fall back to Spark. The full SQL suite shows no new regressions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
