agrawaldevesh commented on pull request #28708:
URL: https://github.com/apache/spark/pull/28708#issuecomment-643566981


   > Although I'm a little fuzzy on what you mean by "eager" (if you mean as 
soon as the migrations are completed then yes)
   
   Thank you for confirming ! By *eager*, I specifically mean _somehow_ 
triggering a code path that can ASAP trigger 
`DAGScheduler#handleExecutorLost(_, workerLost = true)` codepath, such that it 
can clear out the shuffle map files. This is more about not having fetch 
failures from decom as opposed to recouping resources.
   
   One way that this is triggered today is by 
`CoarseGrainedSchedulerBackend.DriverEndpoint#onDisconnected`, but I don't 
really know if there is a timeout at play here. This `workerLost = true` bit is 
set only in a few cases unfortunately, so we might have to add some code (or do 
some testing) to achieve this. 
   
   I think https://issues.apache.org/jira/browse/SPARK-31197 is meant for this ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to