Zhenqiu Huang created FLINK-10868:
-------------------------------------
Summary: Flink's Yarn ResourceManager doesn't use
yarn.maximum-failed-containers as limit of resource acquirement
Key: FLINK-10868
URL: https://issues.apache.org/jira/browse/FLINK-10868
Project: Flink
Issue Type: Bug
Components: YARN
Affects Versions: 1.6.2, 1.7.0
Reporter: Zhenqiu Huang
Assignee: Zhenqiu Huang
Currently, YarnResourceManager does use yarn.maximum-failed-containers as limit
of resource acquirement. In worse case, when new start containers consistently
fail, YarnResourceManager will goes into an infinite resource acquirement
process without failing the job. Together with the
https://issues.apache.org/jira/browse/FLINK-10848, It will quick occupy all
resources of yarn queue.
In production, we observe the following that a task manager failed in HA
enabled Flink job. At the same time, there is a hdfs failover. During that
period, Operation category READ is not supported in state standby. Thus, new
acquired task managers keep on failure.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)