wankunde commented on PR #45661: URL: https://github.com/apache/spark/pull/45661#issuecomment-2019992484
> I think a more general approach is, `DiskBlockManager#createTempShuffleBlock` should co-locate temp shuffle files and final shuffle files (with the same shuffle id and map id). This benefits more than one spill files as well. Spark will rename the TempShuffleBlock file to the final data file when there is only one TempShuffleBlock file. https://github.com/apache/spark/blob/0bbef2090680a7bc2d5a1d8a959ea94a6445291f/core/src/main/java/org/apache/spark/shuffle/sort/UnsafeShuffleWriter.java#L277-L287 If there are multiple TempShuffleBlock files, spark will always read all the shuffle data into the final shuffle data file. At this time, the workload of reading and writing is unavoidable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
