adamreeve commented on code in PR #8671:
URL: https://github.com/apache/arrow-rs/pull/8671#discussion_r2446571086


##########
parquet/src/file/metadata/memory.rs:
##########
@@ -50,6 +51,16 @@ impl<T: HeapSize> HeapSize for Vec<T> {
     }
 }
 
+impl<K: HeapSize, V: HeapSize> HeapSize for HashMap<K, V> {

Review Comment:
   This is likely to be an underestimate of the HashMap heap size as @etseidl 
mentioned in 
https://github.com/apache/arrow-rs/issues/8472#issuecomment-3413714893. 
Internally `std::collection::HashMap` uses a `hashbrown::HashMap` which [holds 
a block of `(K, V)` 
pairs](https://github.com/rust-lang/hashbrown/blob/64aa7d5343478efa9f4bd4b0f7de56110f6da860/src/map.rs#L187),
 which could have a different size than `sizeof::<K>() + sizeof::<V>()` due to 
alignment. Although it looks like the size does match for the `(String, 
Vec<u8>)` pair used for column keys. The number of allocated buckets used is 
also based on a [load factor applied to the 
capacity](https://github.com/rust-lang/hashbrown/blob/64aa7d5343478efa9f4bd4b0f7de56110f6da860/src/raw/mod.rs#L105),
 so the capacity will be an underestimate of the number of buckets, and there 
isn't a way to get the number of internal buckets in the public API.
   
   I'm not sure how much we want to depend on internal implementation details 
of `HashMap` to improve the accuracy of this. And whether it's better to under 
of overestimate the memory used. Maybe it would be better for this to be an 
overestimate?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to