srinivasst opened a new pull request #3287:
URL: https://github.com/apache/hadoop/pull/3287


   When a node heartbeats to the RM, the StatusUpdateWhenHealthyTransition 
checks if rmNode.runningApplications is empty before deactivating the node 
(killing all containers on the node).
   
   The data structure rmNode.runningApplications is updated in the same 
transition but after the call to RMNodeImpl.deactivateNode.
   
   This can lead a race condition when a node is gracefully decommissioned 
immediately after launching a container on a node with 0 containers. 
   
   The heartbeat received from the node just after the container gets launched 
will update rmNode.runningApplications after RMNodeImpl.deactivateNode is 
called, causing the node to be deactivated and all scheduled containers to be 
killed. 
   
   If the container was an AM container, a retry is attempted without counting 
towards application failure. But for cases where max attempts is set to 1, the 
application is never retried (YARN-5617) and hence fails.
   
   This PR checks the scheduler if any application's AM are scheduled on the 
node before deactivating it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to