ozankabak commented on PR #15296: URL: https://github.com/apache/datafusion/pull/15296#issuecomment-2743724228
> I confused the merge and mix, after reviewing the information, "Merge" suggests combining datasets that maintain their original properties, but what's implemented is actually close to a weighted mixture of probability distributions. Do I understand correctly? Right -- `merge` coalesces partial information about a single quantity, while `mix` models a probabilistic selection between two quantities. Your use case seems to fall in the first category. Use cases for mixture arises when modeling things like filters that depend on composite expressions involving random functions etc. > Yes, I agree. HistogramDistribution is merge-able. Does it look like this? ```rust pub struct HistogramDistribution { bins: Vec<Interval>, // The bin boundaries counts: Vec<u64>, // Frequency in each bin total_count: u64, // Sum of all bin counts range: Interval, // Overall range covered by the histogram } ``` I haven't thought about it in detail but this seems reasonable. We'd probably want an attribute specifying the maximum number of bins one can have, because many operations (including `merge`) will have a tendency to increase bins unless special care is taken to coalesce when necessary. Attribute `total_count` is derivable from `counts`, so we may not want to store it for normalization/consistency reasons. Same goes for `range`, it can constructed from `bins` in O(1) time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org