[ https://issues.apache.org/jira/browse/FLINK-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16693833#comment-16693833 ]
Shuyi Chen commented on FLINK-10868: ------------------------------------ Hi [~till.rohrmann], the following describes the the problem that we saw: 1) In YarnResourceManager, after container is allocated, it will start the container in onContainerAllocated(). 2) In createTaskExecutorLaunchContext, it will try to call fs.getFileStatus in registerLocalResource which access the file status on HDFS. 3) In rare scenario when some of above files in HDFS was not accessible due to HDFS issues. createTaskExecutorLaunchContext will throw an exception and cause YarnResourceManager to keep reacquiring resource due to container start failure because that the files are no longer accessible. In the above case, the job will be in a loop of acquiring new resources, since the files is already broken/missing, there is no way to recover by flink itself and we need to fail the job and fall back to the client side to fix the files and resubmit entirely. Together with [FLINK-10848|https://issues.apache.org/jira/browse/FLINK-10848], it even exaggerate the problem and cause the entire YARN queue resource to get depleted. I've submitted a PR to fix [FLINK-10848|https://issues.apache.org/jira/browse/FLINK-10848], could you please also help take a look? I am wondering if we could separate this JIRA into 2 part, one for PerJobCluster, one for session cluster. For this jira, we could 1) apply yarn.maximum-failed-containers for PerJobCluster mode 2) log a warning saying that yarn.maximum-failed-containers is not supported for session cluster. 3) update the documentation on yarn.maximum-failed-containers on website What do you think? > Flink's Yarn ResourceManager doesn't use yarn.maximum-failed-containers as > limit of resource acquirement > -------------------------------------------------------------------------------------------------------- > > Key: FLINK-10868 > URL: https://issues.apache.org/jira/browse/FLINK-10868 > Project: Flink > Issue Type: Bug > Components: YARN > Affects Versions: 1.6.2, 1.7.0 > Reporter: Zhenqiu Huang > Assignee: Zhenqiu Huang > Priority: Major > > Currently, YarnResourceManager does use yarn.maximum-failed-containers as > limit of resource acquirement. In worse case, when new start containers > consistently fail, YarnResourceManager will goes into an infinite resource > acquirement process without failing the job. Together with the > https://issues.apache.org/jira/browse/FLINK-10848, It will quick occupy all > resources of yarn queue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)