agrawaldevesh opened a new pull request #29014:
URL: https://github.com/apache/spark/pull/29014


   ### What changes were proposed in this pull request?
   
   This PR reduces the prospect of a job loss during decommissioning. It
   fixes two holes in the current decommissioning framework:
   
   - (a) Loss of decommissioned executors is not treated as a job failure:
   We know that the decommissioned executor would be dying soon, so its death is
   clearly not caused by the application.
   
   - (b) Shuffle files on the decommissioned host are cleared when the
   first fetch failure is detected from a decommissioned host: This is a
   bit tricky in terms of when to clear the shuffle state ? Ideally you
   want to clear it the millisecond before the shuffle service on the node
   dies (or the executor dies when there is no external shuffle service) --
   too soon and it could lead to some wastage and too late would lead to
   fetch failures.
   
   The approach here is to do this clearing when the very first fetch
   failure is observed on the decommissioned block manager, without waiting for
   other blocks to also signal a failure.
   
   ### Why are the changes needed?
   
   Without them decommissioning a lot of executors at a time leads to job 
failures.
   
   ### Code overview
   
   The `TaskSchedulerImpl` keeps track of all the decommissioned executors in 
the set `executorsPendingDecommission`. This set is updated when 
`TaskScheduler.executorDecommissioned` callback is called. This set is then 
consulted by: (a) The TaskSchedulerImpl when it gets a `SlaveLost` error, or 
(b) by the `DAGScheduler` when handling a fetch failure.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. There is an internal config `FORCE_DECOMMISSIONING_OF_ENTIRE_NODE` added 
as a short term hack and I am happy to remove it (see below)
   
   (The commits changes for these two features share some common code and
   hence I am packing both of them into the same commit.)
   
   ### How was this patch tested?
   
   Added a new unit test `DecommissionWorkerSuite` to test the new behavior by 
exercising the Master-Worker decommissioning. I chose to add a new test since 
the setup logic was quite different from the existing 
`WorkerDecommissionSuite`. I am open to changing the name of the newly added 
test suite :-) 
   
   ### TODO/Discussion
   - Should I add a feature flag to guard these two behaviors ?
   - Should this to be two separate PR's or two commits in the same PR ?
   - The FORCE_DECOMMISSIONING_OF_ENTIRE_NODE flag is a hack: Ideally the
   DecommissionExecutor message should tell us whether the shuffle service
   is being destroyed or not. This flag should be true when the Master is
   decommissioning and entire Worker with ESS enabled, and also similarly true
   when Yarn/K8s is decommissioning just the executor without an ESS.
   Alternatively, we can undo FORCE_DECOMMISSIONING_OF_ENTIRE_NODE here
   until the plumbing of DecommissionExecutor is done -- we only need this
   flag when there are multiple executors on a node.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to