andygrove opened a new issue, #4719: URL: https://github.com/apache/datafusion-comet/issues/4719
### Describe the bug Comet's native `percentile` aggregate (PR #4542) maps to DataFusion's `percentile_cont`, which computes the linear interpolation weight with a quantization step: ```rust const INTERPOLATION_PRECISION: f64 = 1_000_000.0; let fraction = index - (lower_index as f64); let scaled = (fraction * INTERPOLATION_PRECISION) as usize; let weight = scaled as f64 / INTERPOLATION_PRECISION; let interpolated_f = lower_f + (upper_f - lower_f) * weight; ``` The interpolation weight is truncated to 6 decimal places. Spark's exact `Percentile` interpolates with the full-precision fraction (`(position - lower) * higherValue + (higher - position) * lowerValue`), so a deeply-interpolated value can differ from Spark by up to roughly `(upper - lower) * 1e-6`. ### Affected versions Spark 3.4 / 3.5 / 4.0 / 4.1, wherever `percentile(col, p)` (or `median`, or `percentile_cont ... WITHIN GROUP`) maps to the native path. ### Impact Minor. The difference only appears when `p * (n - 1)` has a fractional part not representable in 6 decimal places, and is bounded by `(upper - lower) * 1e-6`. The cases tested in `percentile.sql` match Spark exactly. ### Possible fix Either contribute a higher-precision (or unquantized) interpolation upstream to DataFusion's `percentile_cont`, or implement a Comet-specific accumulator that matches Spark's interpolation exactly. Surfaced by the `percentile` audit accompanying #4542. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
