[ 
https://issues.apache.org/jira/browse/FLINK-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16693833#comment-16693833
 ] 

Shuyi Chen commented on FLINK-10868:
------------------------------------

Hi [~till.rohrmann], the following describes the the problem that we saw:
1) In YarnResourceManager, after container is allocated, it will start the 
container in onContainerAllocated().
2) In createTaskExecutorLaunchContext, it will try to call fs.getFileStatus in 
registerLocalResource which access the file status on HDFS.
3) In rare scenario when some of above files in HDFS was not accessible due to 
HDFS issues.  createTaskExecutorLaunchContext will throw an exception and cause 
YarnResourceManager to keep reacquiring resource due to container start failure 
because that the files are no longer accessible.

In the above case, the job will be in a loop of acquiring new resources, since 
the files is already broken/missing, there is no way to recover by flink itself 
and we need to fail the job and fall back to the client side to fix the files 
and resubmit entirely.  

Together with [FLINK-10848|https://issues.apache.org/jira/browse/FLINK-10848], 
it even exaggerate the problem and cause the entire YARN queue resource to get 
depleted. I've submitted a PR to fix 
[FLINK-10848|https://issues.apache.org/jira/browse/FLINK-10848], could you 
please also help take a look?

I am wondering if we could separate this JIRA into 2 part, one for 
PerJobCluster, one for session cluster. For this jira, we could
1) apply yarn.maximum-failed-containers for PerJobCluster mode
2) log a warning saying that yarn.maximum-failed-containers is not supported 
for session cluster.
3) update the documentation on yarn.maximum-failed-containers on website

What do you think?

> Flink's Yarn ResourceManager doesn't use yarn.maximum-failed-containers as 
> limit of resource acquirement
> --------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-10868
>                 URL: https://issues.apache.org/jira/browse/FLINK-10868
>             Project: Flink
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.6.2, 1.7.0
>            Reporter: Zhenqiu Huang
>            Assignee: Zhenqiu Huang
>            Priority: Major
>
> Currently, YarnResourceManager does use yarn.maximum-failed-containers as 
> limit of resource acquirement. In worse case, when new start containers 
> consistently fail, YarnResourceManager will goes into an infinite resource 
> acquirement process without failing the job. Together with the 
> https://issues.apache.org/jira/browse/FLINK-10848, It will quick occupy all 
> resources of yarn queue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to