sesteves opened a new pull request, #19941: URL: https://github.com/apache/datafusion/pull/19941
The approx_percentile_cont_with_weight function was producing incorrect results due to wrong weight handling in the TDigest implementation. Root cause: In TDigest::new_with_centroid(), the count field was hardcoded to 1 regardless of the actual centroid weight, while the weight was correctly used in the sum calculation. This mismatch caused incorrect percentile calculations since estimate_quantile() uses count to compute the rank. Changes: - Changed TDigest::count from u64 to f64 to properly support fractional weights (consistent with ClickHouse's TDigest implementation) - Fixed new_with_centroid() to use centroid.weight for count - Updated state_fields() in approx_percentile_cont and approx_median to use Float64 for the count field - Added early return in merge_digests() when all centroids have zero weight to prevent panic - Updated test expectations to reflect correct weighted percentile behavior ## Which issue does this PR close? - Closes #19940 ## Rationale for this change The `approx_percentile_cont_with_weight` function produces incorrect weighted percentile results. The bug is in the TDigest implementation where `new_with_centroid()` sets `count: 1` regardless of the actual centroid weight, while the weight is used elsewhere in centroid merging. This mismatch corrupts the percentile calculation. ## What changes are included in this PR? - Changed `TDigest::count` from `u64` to `f64` to properly support fractional weights (consistent with [ClickHouse's TDigest implementation](https://github.com/ClickHouse/ClickHouse/blob/927af1255adb37ace1b95cc3ec4316553b4cb4b4/src/AggregateFunctions/QuantileTDigest.h#L71-L87)) - Fixed `new_with_centroid()` to use `centroid.weight` for count - Updated `state_fields()` in `approx_percentile_cont` and `approx_median` to use `Float64` for the count field - Added early return in `merge_digests()` when all centroids have zero weight to prevent panic - Updated test expectations to reflect correct weighted percentile behavior ## Are these changes tested? Yes. - All existing unit tests in tdigest.rs pass (7 tests) - All SQL logic tests for aggregate functions pass - Manual testing confirms correct behavior with various weight distributions (equal weights, heavy low/high values, linear weights, fractional weights) ## Are there any user-facing changes? Yes, this is a breaking change: 1. Result changes: approx_percentile_cont_with_weight now returns correct weighted percentiles. Queries relying on the previous (incorrect) behavior will see different results. 2. Serialized state format change: The TDigest state field count changes from UInt64 to Float64. Any existing serialized/checkpointed TDigest state will be incompatible and cannot be restored. 3. Edge case behavior change: When all weights are zero, the function now returns NULL instead of the previous undefined behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
