ozankabak commented on PR #15296:
URL: https://github.com/apache/datafusion/pull/15296#issuecomment-2743724228
> I confused the merge and mix, after reviewing the information, "Merge"
suggests combining datasets that maintain their original properties, but what's
implemented is actually close to a weighted mixture of probability
distributions. Do I understand correctly?
Right -- `merge` coalesces partial information about a single quantity,
while `mix` models a probabilistic selection between two quantities. Your use
case seems to fall in the first category. Use cases for mixture arises when
modeling things like filters that depend on composite expressions involving
random functions etc.
> Yes, I agree. HistogramDistribution is merge-able. Does it look like this?
```rust
pub struct HistogramDistribution {
bins: Vec<Interval>, // The bin boundaries
counts: Vec<u64>, // Frequency in each bin
total_count: u64, // Sum of all bin counts
range: Interval, // Overall range covered by the histogram
}
```
I haven't thought about it in detail but this seems reasonable. We'd
probably want an attribute specifying the maximum number of bins one can have,
because many operations (including `merge`) will have a tendency to increase
bins unless special care is taken to coalesce when necessary. Attribute
`total_count` is derivable from `counts`, so we may not want to store it for
normalization/consistency reasons. Same goes for `range`, it can constructed
from `bins` in O(1) time.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]