prakharjain09 opened a new pull request #28370:
URL: https://github.com/apache/spark/pull/28370


   What changes were proposed in this pull request?
   After changes in SPARK-20628, CoarseGrainedSchedulerBackend can decommission 
an executor and stop assigning new tasks on it. We should also decommission the 
corresponding blockmanagers in the same way. i.e. Move the cached RDD blocks 
from those executors to other active executors.
   
   Why are the changes needed?
   We need to gracefully decommission the block managers so that the underlying 
RDD cache blocks are not lost in case the executors are taken away forcefully 
after some timeout (because of spotloss/pre-emptible VM etc). Its good to save 
as much cache data as possible.
   
   Also In future once the decommissioning signal comes from Cluster Manager 
(say YARN/Mesos etc), dynamic allocation + this change gives us opportunity to 
downscale the executors faster by making the executors free of cache data.
   
   Note that this is a best effort approach. We try to move cache blocks from 
decommissioning executors to active executors. If the active executors don't 
have free resources available on them for caching, then the decommissioning 
executors will keep the cache block which it was not able to move and it will 
still be able to serve them.
   
   Current overall Flow:
   
   CoarseGrainedSchedulerBackend receives a signal to decommissionExecutor. On 
receiving the signal, it do 2 things - Stop assigning new tasks (SPARK-20628), 
Send another message to BlockManagerMasterEndpoint (via BlockManagerMaster) to 
decommission the corresponding BlockManager.
   
   BlockManagerMasterEndpoint receives "DecommissionBlockManagers" message. On 
receiving this, it moves the corresponding block managers to "decommissioning" 
state. All decommissioning BMs are excluded from the getPeers RPC call which is 
used for replication. All these decommissioning BMs are also sent message from 
BlockManagerMasterEndpoint to start decommissioning process on themselves.
   
   BlockManager on worker (say BM-x) receives the "DecommissionBlockManager" 
message. Now it will start BlockManagerDecommissionManager thread to offload 
all the RDD cached blocks. This thread can make multiple reattempts to 
decommission the existing cache blocks (multiple reattempts might be needed as 
there might not be sufficient space in other active BMs initially).
   
   Does this PR introduce any user-facing change?
   NO
   
   How was this patch tested?
   Added UTs.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to