Re: [I] Improvements in HashAggregationExec when spilling [arrow-datafusion]

via GitHub Wed, 18 Oct 2023 14:44:43 -0700


alamb commented on issue #7858:
URL: 
https://github.com/apache/arrow-datafusion/issues/7858#issuecomment-1769367404


   Other  potential "strategies" to avoid the memory overshoots might be
   
   # reserve 2x the  memory needed for the hash table, to account for sorting 
on spill,
   Pros: ensures the memory budget is preserved
   
   Cons: will cause more smaller spill files, and will cause some queries to 
spill even if they could have fit entirely in memory (more than half)
   
   # write unsorted data to spill file 
   
   The process would look like
   1. Write (unsorted) batches to disk
   2. when all baches exist, read in each spill file (maybe in 2 parts) and 
rewrite newly sorted files
   3. do the final merge with the sorted files
   
   Pros: ensures we don't spill unless the memory reservation is actually 
exhausted
   
   Cons: Each row is now read/rewritten twice rather than just once


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Improvements in HashAggregationExec when spilling [arrow-datafusion]

Reply via email to