agrawaldevesh opened a new pull request #29014: URL: https://github.com/apache/spark/pull/29014
### What changes were proposed in this pull request? This PR reduces the prospect of a job loss during decommissioning. It fixes two holes in the current decommissioning framework: - (a) Loss of decommissioned executors is not treated as a job failure: We know that the decommissioned executor would be dying soon, so its death is clearly not caused by the application. - (b) Shuffle files on the decommissioned host are cleared when the first fetch failure is detected from a decommissioned host: This is a bit tricky in terms of when to clear the shuffle state ? Ideally you want to clear it the millisecond before the shuffle service on the node dies (or the executor dies when there is no external shuffle service) -- too soon and it could lead to some wastage and too late would lead to fetch failures. The approach here is to do this clearing when the very first fetch failure is observed on the decommissioned block manager, without waiting for other blocks to also signal a failure. ### Why are the changes needed? Without them decommissioning a lot of executors at a time leads to job failures. ### Code overview The `TaskSchedulerImpl` keeps track of all the decommissioned executors in the set `executorsPendingDecommission`. This set is updated when `TaskScheduler.executorDecommissioned` callback is called. This set is then consulted by: (a) The TaskSchedulerImpl when it gets a `SlaveLost` error, or (b) by the `DAGScheduler` when handling a fetch failure. ### Does this PR introduce _any_ user-facing change? No. There is an internal config `FORCE_DECOMMISSIONING_OF_ENTIRE_NODE` added as a short term hack and I am happy to remove it (see below) (The commits changes for these two features share some common code and hence I am packing both of them into the same commit.) ### How was this patch tested? Added a new unit test `DecommissionWorkerSuite` to test the new behavior by exercising the Master-Worker decommissioning. I chose to add a new test since the setup logic was quite different from the existing `WorkerDecommissionSuite`. I am open to changing the name of the newly added test suite :-) ### TODO/Discussion - Should I add a feature flag to guard these two behaviors ? - Should this to be two separate PR's or two commits in the same PR ? - The FORCE_DECOMMISSIONING_OF_ENTIRE_NODE flag is a hack: Ideally the DecommissionExecutor message should tell us whether the shuffle service is being destroyed or not. This flag should be true when the Master is decommissioning and entire Worker with ESS enabled, and also similarly true when Yarn/K8s is decommissioning just the executor without an ESS. Alternatively, we can undo FORCE_DECOMMISSIONING_OF_ENTIRE_NODE here until the plumbing of DecommissionExecutor is done -- we only need this flag when there are multiple executors on a node. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
