alamb commented on issue #15323: URL: https://github.com/apache/datafusion/issues/15323#issuecomment-2744217807
Makes sense -- with 183 spill files, we probably would need to merge in stages For example starting with 183 spill files 1. run 10 jobs, each merging about 10 files into one (results in 10 files) 2. run the final merge of 10 files This results in 2x the IO (have to read / write each row twice) but it would be possible at least to parallelize the merges of the earlier step I think @2010YOUY01 was starting to look into a SpillFileManager -- this is the kind of behavior I would imagine being part of such a thing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org