[I] approx_percentile_cont_with_weight produces incorrect results due to wrong weight handling in TDigest [datafusion]

via GitHub Thu, 22 Jan 2026 08:35:01 -0800


sesteves opened a new issue, #19940:
URL: https://github.com/apache/datafusion/issues/19940


   ### Describe the bug
   
   The approx_percentile_cont_with_weight function produces incorrect weighted 
percentile results. The bug is in the TDigest implementation where 
new_with_centroid() sets count: 1 regardless of the actual centroid weight, 
while the weight is used elsewhere in centroid merging. This mismatch corrupts 
the percentile calculation.
   
   **Root cause**
   
   In datafusion/functions-aggregate-common/src/tdigest.rs, the 
new_with_centroid function incorrectly hardcodes count: 1:
   pub fn new_with_centroid(max_size: usize, centroid: Centroid) -> Self {
       TDigest {
           centroids: vec![centroid.clone()],
           max_size,
           sum: centroid.mean * centroid.weight,
           count: 1,  // BUG: should be centroid.weight
           max: centroid.mean,
           min: centroid.mean,
       }
   }
   The count field should reflect the actual weight of the centroid, as it's 
used in estimate_quantile() to compute the rank: let rank = q * self.count. 
Since the weight is correctly used in sum but not in count, the algorithm 
produces inconsistent results.
   Additionally, count should be f64 (not u64) to properly support fractional 
weights, consistent with the ClickHouse TDigest implementation 
(https://github.com/ClickHouse/ClickHouse/blob/927af1255adb37ace1b95cc3ec4316553b4cb4b4/src/AggregateFunctions/QuantileTDigest.h#L71-L87).
   
   ### To Reproduce
   
   -- Weighted percentile with increasing weights (1,2,3,...,9)
   -- Higher values have higher weights, so weighted median should be > 5
   SELECT approx_percentile_cont_with_weight(value, weight, 0.5) as weighted
   FROM (VALUES (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6,6), (7,7), (8,8), 
(9,9)) AS t(value, weight);
   -- Compare with unweighted percentile
   SELECT approx_percentile_cont(value, 0.5) as unweighted
   FROM (VALUES (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6,6), (7,7), (8,8), 
(9,9)) AS t(value, weight);
   
   **Actual Behavior**
   
   weighted:   3
   unweighted: 5
   The weighted result (3) is lower than the unweighted median (5), even though 
higher values have proportionally higher weights. The result is neither correct 
weighted percentile nor correct unweighted percentile.
   
   ### Expected behavior
   
   weighted:   6 or 7
   unweighted: 5
   With weights (1,2,3,4,5,6,7,8,9):
   - Total weight = 45
   - At 0.5 quantile, rank = 22.5
   - Cumulative weights: 1→1, 2→3, 3→6, 4→10, 5→15, 6→21, 7→28
   - Rank 22.5 falls between value 6 (cumulative 21) and value 7 (cumulative 28)
   - Expected weighted median ≈ 6-7
   
   The weighted median should be pulled toward higher values since they have 
higher weights.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] approx_percentile_cont_with_weight produces incorrect results due to wrong weight handling in TDigest [datafusion]

Reply via email to