Shimin Yang created FLINK-9567:
----------------------------------
Summary: Flink does not release resource in Yarn Cluster mode
Key: FLINK-9567
URL: https://issues.apache.org/jira/browse/FLINK-9567
Project: Flink
Issue Type: Bug
Components: Cluster Management, YARN
Affects Versions: 1.5.0
Reporter: Shimin Yang
After restart the Job Manager in Yarn Cluster mode, Flink does not release task
manager containers in some specific case. According to my observation, the
reason is the instance variable *numPendingContainerRequests* in
*YarnResourceManager* class does not decrease since it has not received the
containers. And after restart of job manager, it will make increase the
*numPendingContainerRequests* by the number of task executors.
Since the callback function *onContainersAllocated* will return the excessive
container immediately only if the *numPendingContainerRequests* <= 0, so the
number of container grows bigger and bigger while only a few are acting as task
manager.
I think it is important to clear the *numPendingContainerRequests* variable
after restart the Job Manager, but not very clear at how to do that. There's no
other way to decrease the *numPendingContainerRequests* except the
*onContainersAllocated*. Is it fine to add a method to operate on the
*numPendingContainerRequests* variable? And meanwhile, there's no handle of
YarnResourceManager in the *ExecutionGraph* restart logic.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)