[GitHub] [spark] mridulm commented on pull request #37922: [SPARK-40480][SHUFFLE] Remove push-based shuffle data after query finished

GitBox Sun, 18 Sep 2022 22:38:22 -0700


mridulm commented on PR #37922:
URL: https://github.com/apache/spark/pull/37922#issuecomment-1250590696


   We should decouple current implementation details when making protocol 
changes, and make it extensible for future evolution.
   
   In this case though, it is much more straightforward - there is an existing 
usecase which requires shuffle merge id.
   When retrying an indeterminate stage, we should cleanup merged shuffle data 
for previous stage attempt (in `submitMissingTasks`, before 
`unregisterAllMapAndMergeOutput`) - and given the potential race conditions 
there, we dont want `RemoveShuffleMerge` to clean up for the next attempt (when 
we add support for this).
   
   This specific change can be done in a follow up PR though - I want to get 
the basic mechanics working in this PR, and ensure the cleanup usecase is 
handled - before looking at further enhancements.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] mridulm commented on pull request #37922: [SPARK-40480][SHUFFLE] Remove push-based shuffle data after query finished

Reply via email to