Re: [PR] Address memory over-accounting in array_agg [datafusion]

via GitHub Fri, 18 Jul 2025 08:07:40 -0700


fmonjalet commented on code in PR #16816:
URL: https://github.com/apache/datafusion/pull/16816#discussion_r2216291759



##########
datafusion/functions-aggregate/src/array_agg.rs:
##########
@@ -1008,8 +1002,7 @@ mod tests {
         acc2.update_batch(&[data(["b", "c", "a"])])?;
         acc1 = merge(acc1, acc2)?;
 
-        // without compaction, the size is 2652.
-        assert_eq!(acc1.size(), 732);
+        assert_eq!(acc1.size(), 266);

Review Comment:
   Nice! To convince myself this approach was still accurate enough, I was 
thinking we should compare this size to the raw data size, which I got from:
   
   ```rust
           let data1 = data(["a", "c", "b"]);
           let data2 = data(["b", "c", "a"]);
           println!("Size of data: {} {}", data1.get_array_memory_size(), 
data2.get_array_memory_size());
   ```
   
   I see that the data size is 1208 for each. _But_ it's related to excessive 
capacity of the Array, which I would hope does not happen too much for larger 
arrays (the ones that actually matter for memory accounting). I would also hope 
unused memory pages are not effectively allocated by the kernel anyway, so not 
taking memory in practice (unsure about that in this case).
   
   If we update the `data` helper to use `shrink_to_fit`:
   
   ```rust
       fn data<T, const N: usize>(list: [T; N]) -> ArrayRef
       where
           ScalarValue: From<T>,
       {
           let values: Vec<_> = 
list.into_iter().map(ScalarValue::from).collect();
           let mut array = ScalarValue::iter_to_array(values).expect("Cannot 
convert to array");
           array.shrink_to_fit();
           array
       }
   ```
   
   Then we get 139 byte per array, which is 278 total, so we are 
under-accounting by only 12 bytes in practice. Which sounds good.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Address memory over-accounting in array_agg [datafusion]

Reply via email to