alamb opened a new issue, #4645:
URL: https://github.com/apache/arrow-rs/issues/4645
**Describe the bug**
When merging GB of high cardinality dictionary data, the size reported by
the RowInterner is a signficiant (GB) undercount
This leads to our system to significant exceed its configured memory limit
in several cases.
I believe the bug is that the `Bucket::size()` does not account for size of
embedded `Bucket` in `Slot`. I will make a PR shortly
**To Reproduce**
I can reproduce this when merge GB of high cardinality proprietary
dictionary encoded data
I tried to make a unit test but I could not figure out how to. Any thoughts
would be appreciated
```
#[test]
fn test_intern_sizes() {
let mut interner = OrderPreservingInterner::default();
// Intern a 1M values each 10 bytes large, and the interner
// should report at least 10MB bytes
// ...
let num_items = 3000;
let mut values: Vec<usize> = (0..num_items).collect();
values.reverse();
interner.intern(values.iter().map(|v| Some(v.to_be_bytes())));
let actual_size = interner.size();
let min_expected_size =
// at least space for each item
num_items * std::mem::size_of::<usize>()
// at least one slot for each item
+ num_items * std::mem::size_of::<Slot>();
println!("Actual size: {actual_size}, min {min_expected_size}");
assert!(actual_size > min_expected_size,
"actual_size {actual_size} not larger than
min_expected_size: {min_expected_size}")
}
```
**Expected behavior**
<!--
A clear and concise description of what you expected to happen.
-->
**Additional context**
I found this while testing
https://github.com/apache/arrow-datafusion/pull/7130 with our internal data --
it did not reduce memory requirements the way I expected. I tracked the root
cause down to this
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]