[jira] [Commented] (FLINK-10868) Flink's JobCluster ResourceManager doesn't use maximum-failed-containers as limit of resource acquirement

Zhenqiu Huang (Jira) Thu, 17 Dec 2020 22:35:07 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251539#comment-17251539
 ]


Zhenqiu Huang commented on FLINK-10868:
---------------------------------------

Hi [~xintongsong]

I just found you commented on the jira ticket. Your summary of the problem and 
solution is accurate. Without the failure rate limit, the worst case we saw is 
that when a bad job that has the issue of download its job jar from hdfs, the 
Flink resource manager will consistently ask for more containers from yarn and 
then block the whole queue. In the outage, it blocks another critical pipeline 
to upgrade job and submit the same queue. Thus, in the current implementation, 
I choose the cancel all of the pending requests and killed the job. 

I agree that it could be a generic solution for both yarn and Kubernetes. 
Besides leveraging FailureRater for cool time management, I would suggest also 
add count metrics for the container failure. So that oncall engieer can handle 
the worst situation in time. How do you think? If we are on the same page, I 
would like the change PR accordingly. Thanks.









> Flink's JobCluster ResourceManager doesn't use maximum-failed-containers as 
> limit of resource acquirement
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-10868
>                 URL: https://issues.apache.org/jira/browse/FLINK-10868
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Mesos, Deployment / YARN
>    Affects Versions: 1.6.2, 1.7.0
>            Reporter: Zhenqiu Huang
>            Assignee: Zhenqiu Huang
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently, YarnResourceManager does use yarn.maximum-failed-containers as 
> limit of resource acquirement. In worse case, when new start containers 
> consistently fail, YarnResourceManager will goes into an infinite resource 
> acquirement process without failing the job. Together with the 
> https://issues.apache.org/jira/browse/FLINK-10848, It will quick occupy all 
> resources of yarn queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-10868) Flink's JobCluster ResourceManager doesn't use maximum-failed-containers as limit of resource acquirement

Reply via email to