agrawaldevesh commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-643566981
> Although I'm a little fuzzy on what you mean by "eager" (if you mean as soon as the migrations are completed then yes) Thank you for confirming ! By *eager*, I specifically mean _somehow_ triggering a code path that can ASAP trigger `DAGScheduler#handleExecutorLost(_, workerLost = true)` codepath, such that it can clear out the shuffle map files. This is more about not having fetch failures from decom as opposed to recouping resources. One way that this is triggered today is by `CoarseGrainedSchedulerBackend.DriverEndpoint#onDisconnected`, but I don't really know if there is a timeout at play here. This `workerLost = true` bit is set only in a few cases unfortunately, so we might have to add some code (or do some testing) to achieve this. I think https://issues.apache.org/jira/browse/SPARK-31197 is meant for this ? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
