Re: [PR] Implement intermediate result blocked approach to aggregation memory management [datafusion]

via GitHub Mon, 19 May 2025 04:57:55 -0700


Rachelint commented on PR #15591:
URL: https://github.com/apache/datafusion/pull/15591#issuecomment-2890735910


   >I wonder what happens if we make it more like at least 1 million or 1MiB so 
the effect on cache-friendliness is smaller?
   We could optimize a growing strategy for the first allocated Vec if memory 
usage / overhead of first block is a concern.
    
   > I think we should try to minimize the impact of this on low-cardinality 
cases (e.g. make sure they fit in one array, minimize the overhead of blocks)...
   
   If I don't misunderstand, does it mean strategy like that:
   - We make the block size large enough
   - For the first block, we still perform `resizing` at firstly
   - But after it grow large enough, we switch to `blocked approach`?
   
   > Yeah it is quite efficient, although problematic for large inputs
   
   Agree. It also leads to large memory usage, because we only release memory 
after all the batches are returned(we hold the really single batch, and only 
return slice of it now).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Implement intermediate result blocked approach to aggregation memory management [datafusion]

Reply via email to