sesteves opened a new pull request, #19941:
URL: https://github.com/apache/datafusion/pull/19941

   The approx_percentile_cont_with_weight function was producing incorrect 
results due to wrong weight handling in the TDigest implementation.
   
   Root cause: In TDigest::new_with_centroid(), the count field was hardcoded 
to 1 regardless of the actual centroid weight, while the weight was correctly 
used in the sum calculation. This mismatch caused incorrect percentile 
calculations since estimate_quantile() uses count to compute the rank.
   
   Changes:
   - Changed TDigest::count from u64 to f64 to properly support fractional 
weights (consistent with ClickHouse's TDigest implementation)
   - Fixed new_with_centroid() to use centroid.weight for count
   - Updated state_fields() in approx_percentile_cont and approx_median to use 
Float64 for the count field
   - Added early return in merge_digests() when all centroids have zero weight 
to prevent panic
   - Updated test expectations to reflect correct weighted percentile behavior
   
   ## Which issue does this PR close?
   
   - Closes #19940
   
   ## Rationale for this change
   
   The `approx_percentile_cont_with_weight` function produces incorrect 
weighted percentile results. The bug is in the TDigest implementation where 
`new_with_centroid()` sets `count: 1` regardless of the actual centroid weight, 
while the weight is used elsewhere in centroid merging. This mismatch corrupts 
the percentile calculation.
   
   ## What changes are included in this PR?
   
   - Changed `TDigest::count` from `u64` to `f64` to properly support 
fractional weights (consistent with [ClickHouse's TDigest 
implementation](https://github.com/ClickHouse/ClickHouse/blob/927af1255adb37ace1b95cc3ec4316553b4cb4b4/src/AggregateFunctions/QuantileTDigest.h#L71-L87))
   
   - Fixed `new_with_centroid()` to use `centroid.weight` for count
   - Updated `state_fields()` in `approx_percentile_cont` and `approx_median` 
to use `Float64` for the count field
   - Added early return in `merge_digests()` when all centroids have zero 
weight to prevent panic
   - Updated test expectations to reflect correct weighted percentile behavior  
  
   
   ## Are these changes tested?
   
   Yes. 
   - All existing unit tests in tdigest.rs pass (7 tests)
   - All SQL logic tests for aggregate functions pass
   - Manual testing confirms correct behavior with various weight distributions 
(equal weights, heavy low/high values, linear weights, fractional weights)
    
   ## Are there any user-facing changes?
   
   Yes, this is a breaking change:
   1. Result changes: approx_percentile_cont_with_weight now returns correct 
weighted percentiles. Queries relying on the previous (incorrect) behavior will 
see different results.
   2. Serialized state format change: The TDigest state field count changes 
from UInt64 to Float64. Any existing serialized/checkpointed TDigest state will 
be incompatible and cannot be restored.
   3. Edge case behavior change: When all weights are zero, the function now 
returns NULL instead of the previous undefined behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to