srinivasst opened a new pull request #3287: URL: https://github.com/apache/hadoop/pull/3287
When a node heartbeats to the RM, the StatusUpdateWhenHealthyTransition checks if rmNode.runningApplications is empty before deactivating the node (killing all containers on the node). The data structure rmNode.runningApplications is updated in the same transition but after the call to RMNodeImpl.deactivateNode. This can lead a race condition when a node is gracefully decommissioned immediately after launching a container on a node with 0 containers. The heartbeat received from the node just after the container gets launched will update rmNode.runningApplications after RMNodeImpl.deactivateNode is called, causing the node to be deactivated and all scheduled containers to be killed. If the container was an AM container, a retry is attempted without counting towards application failure. But for cases where max attempts is set to 1, the application is never retried (YARN-5617) and hence fails. This PR checks the scheduler if any application's AM are scheduled on the node before deactivating it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
