Re: [PR] feat: support merge for `Distribution` [datafusion]

via GitHub Fri, 21 Mar 2025 08:30:39 -0700


ozankabak commented on PR #15296:
URL: https://github.com/apache/datafusion/pull/15296#issuecomment-2743724228


   > I confused the merge and mix, after reviewing the information, "Merge" 
suggests combining datasets that maintain their original properties, but what's 
implemented is actually close to a weighted mixture of probability 
distributions. Do I understand correctly?
   
   Right -- `merge` coalesces partial information about a single quantity, 
while `mix` models a probabilistic selection between two quantities. Your use 
case seems to fall in the first category. Use cases for mixture arises when 
modeling things like filters that depend on composite expressions involving 
random functions etc.
   
   > Yes, I agree. HistogramDistribution is merge-able. Does it look like this?
   ```rust
   pub struct HistogramDistribution {
       bins: Vec<Interval>,     // The bin boundaries
       counts: Vec<u64>,        // Frequency in each bin
       total_count: u64,        // Sum of all bin counts
       range: Interval,         // Overall range covered by the histogram
   }
   ```
   
   I haven't thought about it in detail but this seems reasonable. We'd 
probably want an attribute specifying the maximum number of bins one can have, 
because many operations (including `merge`) will have a tendency to increase 
bins unless special care is taken to coalesce when necessary. Attribute 
`total_count` is derivable from `counts`, so we may not want to store it for 
normalization/consistency reasons. Same goes for `range`, it can constructed 
from `bins` in O(1) time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: support merge for `Distribution` [datafusion]

Reply via email to