marvinlanhenke commented on issue #8764: URL: https://github.com/apache/datafusion/issues/8764#issuecomment-2141501909
@alamb @yyy1001 I've taken a look into this and I can agree with the comments @yyy1001 made on the original PR [here](https://github.com/apache/datafusion/pull/8779#issuecomment-1883677052) and [here](https://github.com/apache/datafusion/pull/8779#issuecomment-1913656924). I think the main problem here is the point of time when the collection (HashSet/HashTable) is created and the size estimation is done or needed. In `count_distinct/native.rs` we already have access to the collection to do the estimation, whereas in `hash_join.rs` the estimation is done prior to the creation. This also leads to differences regarding the use of `std::mem::size_of` vs `std::mem::size_of_val()` In order to make it reusable we need to introduce some params; and unfortunately cannot go full generic (?): Here is an idea / outline: ```Rust fn estimate_memory_size<T>(num_elements: usize, fixed_size: usize) -> usize { let estimated_buckets = (num_elements.checked_mul(8).unwrap_or(usize::MAX) / 7).next_power_of_two(); // fixed size part of memory allocation // 1 byte per number of bucket and the size of the // collection (HashSet, HashTable); if used before collection is created // we can only estimate the fixed size part of the collection. let fix = fixed_size + estimated_buckets; // variable size part of memory allocation // size of entry type multiplied with the number of estimated buckets. let var = std::mem::size_of::<T>() * estimated_buckets; fix + var } ``` Example usage in `PrimitiveDistinctCountAccumulator` - `fn size()` ```Rust fn size(&self) -> usize { let fixed_size = std::mem::size_of_val(self) + std::mem::size_of_val(&self.values); estimate_memory_size::<T::Native>(self.values.len(), fixed_size) } ``` Example usage in `hash_join.rs`: ```Rust let fixed_size = std::mem::size_of::<JoinHashMap>(); let estimated = estimate_memory_size::<(u64,u64)>(num_rows, fixed_size); ``` @alamb let me know what you think. Although, this might not be the best abstraction (since we need to provide the fixed_size) it might still have its benefits - in terms of consistency, maintainabiltiy, and testability? If you think its worth it, I'd proceed with implementing it in `datafusion_common` (anywhere specific here?). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
