alamb commented on issue #7858: URL: https://github.com/apache/arrow-datafusion/issues/7858#issuecomment-1769367404
Other potential "strategies" to avoid the memory overshoots might be # reserve 2x the memory needed for the hash table, to account for sorting on spill, Pros: ensures the memory budget is preserved Cons: will cause more smaller spill files, and will cause some queries to spill even if they could have fit entirely in memory (more than half) # write unsorted data to spill file The process would look like 1. Write (unsorted) batches to disk 2. when all baches exist, read in each spill file (maybe in 2 parts) and rewrite newly sorted files 3. do the final merge with the sorted files Pros: ensures we don't spill unless the memory reservation is actually exhausted Cons: Each row is now read/rewritten twice rather than just once -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
