fmonjalet commented on code in PR #16816:
URL: https://github.com/apache/datafusion/pull/16816#discussion_r2216291759
##########
datafusion/functions-aggregate/src/array_agg.rs:
##########
@@ -1008,8 +1002,7 @@ mod tests {
acc2.update_batch(&[data(["b", "c", "a"])])?;
acc1 = merge(acc1, acc2)?;
- // without compaction, the size is 2652.
- assert_eq!(acc1.size(), 732);
+ assert_eq!(acc1.size(), 266);
Review Comment:
Nice! To convince myself this approach was still accurate enough, I was
thinking we should compare this size to the raw data size, which I got from:
```rust
let data1 = data(["a", "c", "b"]);
let data2 = data(["b", "c", "a"]);
println!("Size of data: {} {}", data1.get_array_memory_size(),
data2.get_array_memory_size());
```
I see that the data size is 1208 for each. _But_ it's related to excessive
capacity of the Array, which I would hope does not happen too much for larger
arrays (the ones that actually matter for memory accounting). I would also hope
unused memory pages are not effectively allocated by the kernel anyway, so not
taking memory in practice (unsure about that in this case).
If we update the `data` helper to use `shrink_to_fit`:
```rust
fn data<T, const N: usize>(list: [T; N]) -> ArrayRef
where
ScalarValue: From<T>,
{
let values: Vec<_> =
list.into_iter().map(ScalarValue::from).collect();
let mut array = ScalarValue::iter_to_array(values).expect("Cannot
convert to array");
array.shrink_to_fit();
array
}
```
Then we get 139 byte per array, which is 278 total, so we are
under-accounting by only 12 bytes in practice. Which sounds good.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]