[
https://issues.apache.org/jira/browse/FLINK-12342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Till Rohrmann updated FLINK-12342:
----------------------------------
Fix Version/s: (was: 1.8.1)
(was: 1.7.3)
(was: 1.9.0)
1.9.2
1.8.3
1.10.0
> Yarn Resource Manager Acquires Too Many Containers
> --------------------------------------------------
>
> Key: FLINK-12342
> URL: https://issues.apache.org/jira/browse/FLINK-12342
> Project: Flink
> Issue Type: Bug
> Components: Deployment / YARN
> Affects Versions: 1.6.4, 1.7.2, 1.8.0
> Environment: We runs job in Flink release 1.6.3.
> Reporter: Zhenqiu Huang
> Assignee: Till Rohrmann
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.10.0, 1.8.3, 1.9.2
>
> Attachments: Screen Shot 2019-04-29 at 12.06.23 AM.png,
> container.log, flink-1.4.png, flink-1.6.png
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> In currently implementation of YarnFlinkResourceManager, it starts to acquire
> new container one by one when get request from SlotManager. The mechanism
> works when job is still, say less than 32 containers. If the job has 256
> container, containers can't be immediately allocated and appending requests
> in AMRMClient will be not removed accordingly. We observe the situation that
> AMRMClient ask for current pending request + 1 (the new request from slot
> manager) containers. In this way, during the start time of such job, it asked
> for 4000+ containers. If there is an external dependency issue happens, for
> example hdfs access is slow. Then, the whole job will be blocked without
> getting enough resource and finally killed with SlotManager request timeout.
> Thus, we should use the total number of container asked rather than pending
> request in AMRMClient as threshold to make decision whether we need to add
> one more resource request.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)