Zhenqiu Huang created FLINK-10868:
-------------------------------------

             Summary: Flink's Yarn ResourceManager doesn't use 
yarn.maximum-failed-containers as limit of resource acquirement
                 Key: FLINK-10868
                 URL: https://issues.apache.org/jira/browse/FLINK-10868
             Project: Flink
          Issue Type: Bug
          Components: YARN
    Affects Versions: 1.6.2, 1.7.0
            Reporter: Zhenqiu Huang
            Assignee: Zhenqiu Huang


Currently, YarnResourceManager does use yarn.maximum-failed-containers as limit 
of resource acquirement. In worse case, when new start containers consistently 
fail, YarnResourceManager will goes into an infinite resource acquirement 
process without failing the job. Together with the 
https://issues.apache.org/jira/browse/FLINK-10848, It will quick occupy all 
resources of yarn queue.

In production, we observe the following that a task manager failed in HA 
enabled Flink job. At the same time, there is a hdfs failover. During that 
period, Operation category READ is not supported in state standby. Thus, new 
acquired task managers keep on failure. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to