[
https://issues.apache.org/jira/browse/FLINK-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16693833#comment-16693833
]
Shuyi Chen commented on FLINK-10868:
------------------------------------
Hi [~till.rohrmann], the following describes the the problem that we saw:
1) In YarnResourceManager, after container is allocated, it will start the
container in onContainerAllocated().
2) In createTaskExecutorLaunchContext, it will try to call fs.getFileStatus in
registerLocalResource which access the file status on HDFS.
3) In rare scenario when some of above files in HDFS was not accessible due to
HDFS issues. createTaskExecutorLaunchContext will throw an exception and cause
YarnResourceManager to keep reacquiring resource due to container start failure
because that the files are no longer accessible.
In the above case, the job will be in a loop of acquiring new resources, since
the files is already broken/missing, there is no way to recover by flink itself
and we need to fail the job and fall back to the client side to fix the files
and resubmit entirely.
Together with [FLINK-10848|https://issues.apache.org/jira/browse/FLINK-10848],
it even exaggerate the problem and cause the entire YARN queue resource to get
depleted. I've submitted a PR to fix
[FLINK-10848|https://issues.apache.org/jira/browse/FLINK-10848], could you
please also help take a look?
I am wondering if we could separate this JIRA into 2 part, one for
PerJobCluster, one for session cluster. For this jira, we could
1) apply yarn.maximum-failed-containers for PerJobCluster mode
2) log a warning saying that yarn.maximum-failed-containers is not supported
for session cluster.
3) update the documentation on yarn.maximum-failed-containers on website
What do you think?
> Flink's Yarn ResourceManager doesn't use yarn.maximum-failed-containers as
> limit of resource acquirement
> --------------------------------------------------------------------------------------------------------
>
> Key: FLINK-10868
> URL: https://issues.apache.org/jira/browse/FLINK-10868
> Project: Flink
> Issue Type: Bug
> Components: YARN
> Affects Versions: 1.6.2, 1.7.0
> Reporter: Zhenqiu Huang
> Assignee: Zhenqiu Huang
> Priority: Major
>
> Currently, YarnResourceManager does use yarn.maximum-failed-containers as
> limit of resource acquirement. In worse case, when new start containers
> consistently fail, YarnResourceManager will goes into an infinite resource
> acquirement process without failing the job. Together with the
> https://issues.apache.org/jira/browse/FLINK-10848, It will quick occupy all
> resources of yarn queue.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)