Shimin Yang created FLINK-9567:
----------------------------------

             Summary: Flink does not release resource in Yarn Cluster mode
                 Key: FLINK-9567
                 URL: https://issues.apache.org/jira/browse/FLINK-9567
             Project: Flink
          Issue Type: Bug
          Components: Cluster Management, YARN
    Affects Versions: 1.5.0
            Reporter: Shimin Yang


After restart the Job Manager in Yarn Cluster mode, Flink does not release task 
manager containers in some specific case. According to my observation, the 
reason is the instance variable *numPendingContainerRequests* in 
*YarnResourceManager* class does not decrease since it has not received the 
containers. And after restart of job manager, it will make increase the 
*numPendingContainerRequests* by the number of task executors. 

Since the callback function *onContainersAllocated* will return the excessive 
container immediately only if the *numPendingContainerRequests* <= 0, so the 
number of container grows bigger and bigger while only a few are acting as task 
manager.

I think it is important to clear the *numPendingContainerRequests* variable 
after restart the Job Manager, but not very clear at how to do that. There's no 
other way to decrease the *numPendingContainerRequests* except the 
*onContainersAllocated*. Is it fine to add a method to operate on the 
*numPendingContainerRequests* variable? And meanwhile, there's no handle of 
YarnResourceManager in the *ExecutionGraph* restart logic.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to