sesteves opened a new issue, #19940:
URL: https://github.com/apache/datafusion/issues/19940
### Describe the bug
The approx_percentile_cont_with_weight function produces incorrect weighted
percentile results. The bug is in the TDigest implementation where
new_with_centroid() sets count: 1 regardless of the actual centroid weight,
while the weight is used elsewhere in centroid merging. This mismatch corrupts
the percentile calculation.
**Root cause**
In datafusion/functions-aggregate-common/src/tdigest.rs, the
new_with_centroid function incorrectly hardcodes count: 1:
pub fn new_with_centroid(max_size: usize, centroid: Centroid) -> Self {
TDigest {
centroids: vec![centroid.clone()],
max_size,
sum: centroid.mean * centroid.weight,
count: 1, // BUG: should be centroid.weight
max: centroid.mean,
min: centroid.mean,
}
}
The count field should reflect the actual weight of the centroid, as it's
used in estimate_quantile() to compute the rank: let rank = q * self.count.
Since the weight is correctly used in sum but not in count, the algorithm
produces inconsistent results.
Additionally, count should be f64 (not u64) to properly support fractional
weights, consistent with the ClickHouse TDigest implementation
(https://github.com/ClickHouse/ClickHouse/blob/927af1255adb37ace1b95cc3ec4316553b4cb4b4/src/AggregateFunctions/QuantileTDigest.h#L71-L87).
### To Reproduce
-- Weighted percentile with increasing weights (1,2,3,...,9)
-- Higher values have higher weights, so weighted median should be > 5
SELECT approx_percentile_cont_with_weight(value, weight, 0.5) as weighted
FROM (VALUES (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6,6), (7,7), (8,8),
(9,9)) AS t(value, weight);
-- Compare with unweighted percentile
SELECT approx_percentile_cont(value, 0.5) as unweighted
FROM (VALUES (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6,6), (7,7), (8,8),
(9,9)) AS t(value, weight);
**Actual Behavior**
weighted: 3
unweighted: 5
The weighted result (3) is lower than the unweighted median (5), even though
higher values have proportionally higher weights. The result is neither correct
weighted percentile nor correct unweighted percentile.
### Expected behavior
weighted: 6 or 7
unweighted: 5
With weights (1,2,3,4,5,6,7,8,9):
- Total weight = 45
- At 0.5 quantile, rank = 22.5
- Cumulative weights: 1→1, 2→3, 3→6, 4→10, 5→15, 6→21, 7→28
- Rank 22.5 falls between value 6 (cumulative 21) and value 7 (cumulative 28)
- Expected weighted median ≈ 6-7
The weighted median should be pulled toward higher values since they have
higher weights.
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]