[
https://issues.apache.org/jira/browse/FLINK-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shimin Yang updated FLINK-9567:
-------------------------------
Attachment: fulllog.txt
> Flink does not release resource in Yarn Cluster mode
> ----------------------------------------------------
>
> Key: FLINK-9567
> URL: https://issues.apache.org/jira/browse/FLINK-9567
> Project: Flink
> Issue Type: Bug
> Components: Cluster Management, YARN
> Affects Versions: 1.5.0
> Reporter: Shimin Yang
> Priority: Major
> Attachments: FlinkYarnProblem, fulllog.txt
>
>
> After restart the Job Manager in Yarn Cluster mode, Flink does not release
> task manager containers in some specific case.
> In the first log I posted, the container with id 24 is the reason why Yarn
> did not release resources. Although the Task Manager in the container with id
> 24 was released before restart.
> But in line 347,
> 2018-06-14 22:50:47,846 WARN akka.remote.ReliableDeliverySupervisor -
> Association with remote system [akka.tcp://flink@bd-r1hdp69:30609] has
> failed, address is now gated for [50] ms. Reason: [Disassociated]
> this problem caused flink to request for one more container more than need.
> As the excessive container return id determined by the
> *numPendingContainerRequests* variable in *YarnResourceManager*, I think it's
> the *onContainersCompleted* in *YarnResourceManager* called the method
> *requestYarnContainer* which leads to the increase of
> *numPendingContainerRequests.* However, the restart logic has already
> allocated enough containers for Task Managers, Flink will possess the extra
> container for a long time for nothing. In the worst case, I had a job
> configured to 5 task managers, but possess more than 100 containers in the
> end.
> ps: Another strange thing I found is that when sometimes request for a yarn
> container, it will return much more than requested. Is it a normal scenario
> for AMRMAsyncClient?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)