Re: [PR] Implement intermediate result blocked approach to aggregation memory management [datafusion]

via GitHub Mon, 19 May 2025 04:16:50 -0700


alamb commented on PR #15591:
URL: https://github.com/apache/datafusion/pull/15591#issuecomment-2890562106


   > BTW, I am confused about why so many `page_fault` in the `blocked 
accumualte`.
   
   Me too -- I looked at the flamegraph you provided and I agree it seems like 
almost half the allocation time is spent with pagefaults / zeroing memory. 
However, I can't tell if that is because there is slowness with the underlying 
Vec that wasn't initialized or if there is something else going on. 
   
   ![Screenshot 2025-05-19 at 6 34 12 
AM](https://github.com/user-attachments/assets/f93a4e12-f33a-4197-9fba-1e9db7536fe5)
   
   
   > * But it maybe not really help much for performance (performance improve 
mainly due to removal of usage of expansive `slice`) currently.
   
   Yes, that was my understanding -- that blocked aggregation would only help 
performance when the number of intermediate groups was large (which forced 
additional memory allocations)
   
   
   > But inspired by the `batch_size` based memory allocation, I am thinking 
can we have some ways to reuse memory? And I am trying it today.
   
   I suspect you already know this, but I think you can get back the original 
Vec from an array via 
   1. `PrimitiveArray::into_parts()`  --> get a `ScalarBuffer`
   2. `ScalarBuffer::into_inner()` --> get a `Buffer`
   3. 
`[Buffer::into_vec()](https://docs.rs/arrow/latest/arrow/buffer/struct.Buffer.html#method.into_vec)`
   
   However, in the high cardinality case, I am not sure there are buffers to 
reuse during aggregation (the buffers are all held until the output is needed, 
and then once output is needed they don't get re-created)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Implement intermediate result blocked approach to aggregation memory management [datafusion]

Reply via email to