prakharjain09 edited a comment on issue #27864: [SPARK-20732][CORE] 
Decommission cache blocks to other executors when an executor is decommissioned
URL: https://github.com/apache/spark/pull/27864#issuecomment-605599508
 
 
   > Thank you so much for working on this, I'm really you glad you picked it 
up. I have some questions about the design and some small places for 
improvement, but really excited.
   
   @holdenk Thanks for the review. My bad, I should have added the overall 
design initially itself.
   
   Current overall design:
   
   1) CoarseGrainedSchedulerBackend receives a signal to decommissionExecutor. 
On receiving the signal, it do 2 things - Stop assigning new tasks 
(SPARK-20628), Send another message to BlockManagerMasterEndpoint (via 
BlockManagerMaster) to decommission the corresponding BlockManager.
   
   2) BlockManagerMasterEndpoint receives "DecommissionBlockManagers" message. 
On receiving this, 
    - it updates the "decommissioningBlockManagerSet". This set contains all 
BMs which are undergoing decommissioning. This set is maintained to giver the 
correct peerList to the active BMs. Active BMs keep on asking 
BlockManagerMasterEndpoint for peer list. We should only give active BM's as 
part of peer list. 
   
    - it sends a message to the corresponding BlockManager to start the 
decommissioning process.
   
   
   3) BlockManager on worker (say BM-x) receives the "DecommissionBlockManager" 
message. Now it will take following 2 actions on receiving the message - 
   - The BM-x will stop accepting new RDD Cache blocks. Since this block 
manager is in decommissioning state - so it shouldn't store any new cache data 
(this is similar to DataNode decommissioning in HDFS). If it accepts new blocks 
- then those blocks also needs to be offloaded.
   - BM-x will start BlockManagerDecommissionManager thread. This thread will 
try to offload any cache blocks on this BM every 30 seconds (or configured 
time). In the first attempt - it might be possible that all cache blocks are 
not offloaded because of limited space on other BMs, so it can retry offloading 
after 30 seconds and so on... The available cache space in the application can 
keep on changing, because of dynamic allocation/or user can explicitly uncache 
some other RDD. So we can do multiple attempts to offload RDD cache blocks from 
BM-x. Steps performed in single attempt: 
   a) Ask for replication info of all the cache blocks on BM-x with 
BlockManagerMasterEndpoint. This will let BM-x know which block is already 
replicated at what all places - So that we can avoid replicating at those 
places.
   b) For each block - try to replicate it to peers. Drop the block if block is 
successfully replicated to one of the peer, else keep the block. If the block 
is replicated and dropped from BM-x, the same will be automatically 
communicated to the driver and dynamic allocation (ExecutorMonitor class) will 
update its bookkeeping and remove the executor from the system.
   
   Please review the same and provide your suggestions. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to