[PR] feat: support exact percentile aggregate natively [datafusion-comet]

via GitHub Sat, 30 May 2026 11:32:41 -0700


andygrove opened a new pull request, #4542:
URL: https://github.com/apache/datafusion-comet/pull/4542


   ## Which issue does this PR close?
   
   Part of #3190.
   
   ## Rationale for this change
   
   Comet had no native percentile aggregate, so `percentile(...)` (and the ANSI 
`percentile_cont(...) WITHIN GROUP`, which Spark rewrites to `Percentile`) 
always fell back to Spark. Codegen dispatch is not an option here: `Percentile` 
is a `TypedImperativeAggregate`, and the codegen dispatcher is a per-row scalar 
kernel that explicitly cannot run aggregates. So the only paths are native or 
fall back, and this PR wires it natively.
   
   DataFusion's `percentile_cont` computes the percentile with `index = p * (n 
- 1)` and linear interpolation between the two closest ranks, which is exactly 
Spark's exact `Percentile` algorithm. So the common single-percentage form 
matches Spark.
   
   ## What changes are included in this PR?
   
   - proto: new `Percentile` `AggExpr` message (`child`, `percentage`, 
`datatype`).
   - native planner (`planner.rs`): map `AggExprStruct::Percentile` to 
`percentile_cont_udaf()` with args `[child, percentile]`.
   - `CometPercentile` serde: `Compatible` for a single literal double 
percentage, default frequency, and numeric input. The child is cast to double 
so the native result is `DoubleType`, matching Spark.
   - `operators.adjustOutputForNativeState`: map Percentile's 
`TypedImperativeAggregate` `Binary` partial buffer to the native 
`List<Float64>` state (`ArrayType(DoubleType)`), mirroring the existing 
`CollectSet` handling, so the partial/shuffle/final exchange schema is correct.
   
   Out of scope (fall back to Spark): an array of percentages, a non-default 
frequency argument, and interval inputs. `approx_percentile` is deliberately 
not included (t-digest vs Spark's GK algorithm; tracked separately under #3189).
   
   Known minor caveat: DataFusion quantizes the interpolation fraction to 6 
decimal places, so a deeply-interpolated value could in principle differ from 
Spark in the last ULPs. The tested percentiles match exactly; if needed this 
can be revisited with a custom accumulator.
   
   ## How are these changes tested?
   
   A SQL file test (`expressions/aggregate/percentile.sql`) run by 
`CometSqlFileTestSuite` covers global, grouped, integer-input, all-null-group, 
and exact and interpolated percentiles, asserting answer parity and native 
execution via `checkSparkAnswerAndOperator`. It also asserts that the 
array-of-percentages and frequency-argument forms fall back to Spark. The full 
SQL suite shows no new regressions.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat: support exact percentile aggregate natively [datafusion-comet]

Reply via email to