[
https://issues.apache.org/jira/browse/FLINK-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251539#comment-17251539
]
Zhenqiu Huang commented on FLINK-10868:
---------------------------------------
Hi [~xintongsong]
I just found you commented on the jira ticket. Your summary of the problem and
solution is accurate. Without the failure rate limit, the worst case we saw is
that when a bad job that has the issue of download its job jar from hdfs, the
Flink resource manager will consistently ask for more containers from yarn and
then block the whole queue. In the outage, it blocks another critical pipeline
to upgrade job and submit the same queue. Thus, in the current implementation,
I choose the cancel all of the pending requests and killed the job.
I agree that it could be a generic solution for both yarn and Kubernetes.
Besides leveraging FailureRater for cool time management, I would suggest also
add count metrics for the container failure. So that oncall engieer can handle
the worst situation in time. How do you think? If we are on the same page, I
would like the change PR accordingly. Thanks.
> Flink's JobCluster ResourceManager doesn't use maximum-failed-containers as
> limit of resource acquirement
> ---------------------------------------------------------------------------------------------------------
>
> Key: FLINK-10868
> URL: https://issues.apache.org/jira/browse/FLINK-10868
> Project: Flink
> Issue Type: Bug
> Components: Deployment / Mesos, Deployment / YARN
> Affects Versions: 1.6.2, 1.7.0
> Reporter: Zhenqiu Huang
> Assignee: Zhenqiu Huang
> Priority: Major
> Labels: pull-request-available
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> Currently, YarnResourceManager does use yarn.maximum-failed-containers as
> limit of resource acquirement. In worse case, when new start containers
> consistently fail, YarnResourceManager will goes into an infinite resource
> acquirement process without failing the job. Together with the
> https://issues.apache.org/jira/browse/FLINK-10848, It will quick occupy all
> resources of yarn queue.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)