Re: [PR] [SPARK-47518][CORE] Skip transfer the last spilled shuffle data [spark]

via GitHub Tue, 26 Mar 2024 02:58:56 -0700


wankunde commented on PR #45661:
URL: https://github.com/apache/spark/pull/45661#issuecomment-2019992484


   > I think a more general approach is, 
`DiskBlockManager#createTempShuffleBlock` should co-locate temp shuffle files 
and final shuffle files (with the same shuffle id and map id). This benefits 
more than one spill files as well.
   
   Spark will rename the TempShuffleBlock file to the final data file when 
there is only one TempShuffleBlock file.
   
   
https://github.com/apache/spark/blob/0bbef2090680a7bc2d5a1d8a959ea94a6445291f/core/src/main/java/org/apache/spark/shuffle/sort/UnsafeShuffleWriter.java#L277-L287
   
   If there are multiple TempShuffleBlock files, spark will always read all the 
shuffle data into the final shuffle data file.
   At this time, the workload of reading and writing is unavoidable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-47518][CORE] Skip transfer the last spilled shuffle data [spark]

Reply via email to