xudong963 commented on PR #15296:
URL: https://github.com/apache/datafusion/pull/15296#issuecomment-2743665328
> We can only merge two statistical objects in certain special
circumstances. For example, if we have a statistical object that tracks sample
averages along with counts, we can merge two instances of them. Our
distributions are not merge-able quantities in this sense. They are _mixable_
(with a given weight), but not _merge-able_.
I confused the `merge` and `mix`, after reviewing the information, "Merge"
suggests combining datasets that maintain their original properties, but what's
implemented is actually close to a weighted mixture of probability
distributions. Do I understand correctly?
> One of the follow-ups we previously discussed was adding a
`HistogramDistribution` object that tracks bins and ranges. These objects will
be merge-able. Therefore, we should start off by adding a
`HistogramDistribution` object first. Then, we can add a `merge` API to that
object.
Yes, I agree. `HistogramDistribution` is merge-able. Does it look like this?
```rust
pub struct HistogramDistribution {
bins: Vec<Interval>, // The bin boundaries
counts: Vec<u64>, // Frequency in each bin
total_count: u64, // Sum of all bin counts
range: Interval, // Overall range covered by the histogram
}
```
> If you think we should have a `mix` API for the general `Distribution`
object, we can add it too. Such an API will need to include a mixing weight in
its signature.
This is my use case:
https://github.com/apache/datafusion/pull/13296/files#diff-8d786f45bc2d5bf629754a119ed6fa7998dcff7faacd954c45945b7047b87fa1R498,
merge the file statistics in the whole file group. I'm still thinking if `mix`
API can satisfy my requirement.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]