alamb commented on code in PR #19941:
URL: https://github.com/apache/datafusion/pull/19941#discussion_r2743067144
##########
datafusion/functions-aggregate-common/src/tdigest.rs:
##########
@@ -110,7 +99,7 @@ pub struct TDigest {
centroids: Vec<Centroid>,
max_size: usize,
sum: f64,
- count: u64,
+ count: f64,
Review Comment:
> If count were a u64, it couldn't accurately represent the sum of
fractional weights, leading to inaccurate percentile calculations
Thank you @sesteves
Interestingly in ClickHouse I see the count is updated with the current sum
https://github.com/ClickHouse/ClickHouse/blob/927af1255adb37ace1b95cc3ec4316553b4cb4b4/src/AggregateFunctions/QuantileTDigest.h#L191
```c++
count = sum + l_count; // Update count, it might be different
due to += inaccuracy
```
I see now this is similar to how `TDigest::new_with_centroid` sets count to
the existing weight
https://github.com/alamb/datafusion/blob/4b35f827725ddd47f739b3731b6a3ff2a10a3c2d/datafusion/functions-aggregate-common/src/tdigest.rs#L125-L124
So it seems like `count` is actually some sort of weighted total rather than
a count which confused me
No changes needed. Thank you for the explanation
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]