marvinlanhenke commented on issue #8764:
URL: https://github.com/apache/datafusion/issues/8764#issuecomment-2143467627

   ...while working on this I noticed another issue in the current impl 
[/count_distinct/native.rs](https://github.com/apache/datafusion/blob/main/datafusion/physical-expr/src/aggregate/count_distinct/native.rs#L117-L129)
   
   Although we perform a checked_mul when estimating buckets; we unwrap with 
usize::MAX which leads to a high number of estimated buckets. Later we perform 
another multiplication with the size of the entry type which can also overflow.
   
   I think returning and error and changing the signature to `Result<usize>` 
instead of unwrapping would be the cleanest solution? However this requires a 
lot of changes for the `trait Accumulator` and all its implementors;
   
   I'm not sure we want to do this change for an edge-case like this? The other 
option would be to simply panic?
   
   @alamb @yyy1000 WDYT?
   
   ```Rust
       fn size(&self) -> usize {
           let estimated_buckets = 
(self.values.len().checked_mul(8).unwrap_or(usize::MAX)
               / 7)
           .next_power_of_two();
   
           // This can still overflow
           std::mem::size_of_val(self)
               + std::mem::size_of::<T::Native>() * estimated_buckets
               + estimated_buckets
               + std::mem::size_of_val(&self.values)
       }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to