alamb commented on issue #15323:
URL: https://github.com/apache/datafusion/issues/15323#issuecomment-2744217807

   Makes sense -- with 183 spill files, we probably would need to merge in 
stages
   
   For example starting with 183 spill files
   1. run 10 jobs, each merging about 10 files into one (results in 10 files)
   2. run the final merge of 10 files
   
   This results in 2x the IO (have to read / write each row twice) but it would 
be possible at least to parallelize the merges of the earlier step
   
   I think @2010YOUY01  was starting to look into a SpillFileManager -- this is 
the kind of behavior I would imagine being part of such a thing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to