adamreeve commented on code in PR #8671:
URL: https://github.com/apache/arrow-rs/pull/8671#discussion_r2446571086
##########
parquet/src/file/metadata/memory.rs:
##########
@@ -50,6 +51,16 @@ impl<T: HeapSize> HeapSize for Vec<T> {
}
}
+impl<K: HeapSize, V: HeapSize> HeapSize for HashMap<K, V> {
Review Comment:
This is likely to be an underestimate of the HashMap heap size as @etseidl
mentioned in
https://github.com/apache/arrow-rs/issues/8472#issuecomment-3413714893.
Internally `std::collection::HashMap` uses a `hashbrown::HashMap` which [holds
a block of `(K, V)`
pairs](https://github.com/rust-lang/hashbrown/blob/64aa7d5343478efa9f4bd4b0f7de56110f6da860/src/map.rs#L187),
which could have a different size than `sizeof::<K>() + sizeof::<V>()` due to
alignment. Although it looks like the size does match for the `(String,
Vec<u8>)` pair used for column keys. The number of allocated buckets used is
also based on a [load factor applied to the
capacity](https://github.com/rust-lang/hashbrown/blob/64aa7d5343478efa9f4bd4b0f7de56110f6da860/src/raw/mod.rs#L105),
so the capacity will be an underestimate of the number of buckets, and there
isn't a way to get the number of internal buckets in the public API.
I'm not sure how much we want to depend on internal implementation details
of `HashMap` to improve the accuracy of this. And whether it's better to under
of overestimate the memory used. Maybe it would be better for this to be an
overestimate?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]