[jira] [Commented] (FLINK-10868) Flink's JobCluster ResourceManager doesn't use maximum-failed-containers as limit of resource acquirement

Xintong Song (Jira) Mon, 07 Dec 2020 00:55:36 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17245067#comment-17245067
 ]


Xintong Song commented on FLINK-10868:
--------------------------------------

Hi [~hpeter],

I've gone through the previous discussions and the PR. Here's how I understand 
the problem and the proposed changes. Please correct me if I'm wrong.
 * Problem
 ** Flink may encounter failures trying to start new task managers. For some of 
these failures, an immediate retry may not help. E.g., files on HDFS are 
temporally/permanently unavailable.
 ** In such cases, Flink will keep allocating new containers from Yarn and 
release them due to failures in starting them.
 *** The retrying should no longer be infinite. It should stop when the slot 
requests are timed out, because Flink now always checks the pending requests 
before re-requesting resources.
 * Solution
 ** The proposed approach leverages a `FailureRater` that counts starting task 
manager failures within a certain time to decide whether the immediate retry is 
desired.
 ** If a high container failure rate is detected recently, Flink stops 
allocating new containers by canceling all pending slot requests and stop 
accepting new requests for a while.
 ** It leaves the decision whether and when to re-request resources to the job 
masters.

FYI, we have encountered and solved a similar problem with Kubernetes API 
server being temporarily available. The two steps on Yarn, allocating and 
starting a container, are combined into one step on Kubernetes, adding a new 
pod. If the Kubernetes API server is temporarily unavailable (e.g., due to too 
much load), Flink's could retry immediately over and over again. This not only 
keeps the JM cpu busy, but also adds more load to the Kubernetes API server.

We solved the problem with a different approach. Whenever an adding pod failure 
happened, Flink stops adding new pods for a "cool down time", during which 
attempts to add new pods are cached and re-triggered after the cool down. In 
general, instead of canceling pending requests and rejecting new requests, it 
simply takes a break before retrying. See 
`KubernetesResourceManagerDriver#podCreationCoolDown` for details.

We have kept the solution within `KubernetesResourceManagerDriver` because we 
thought it is specific to the native Kubernetes deployment. If Yarn/Mesos have 
similar problems, it makes sense to move the approach to the common 
`AbstractResourceManagerDriver` or `ActiveResourceManager`.

I have a feeling that these two approaches can be complementary.
 * The idea of `FailureRater` is more advanced compared to reacting to every 
single failure. Allowing occasional failures are significant, especially on 
Yarn where Flink talks to many Yarn NMs and the problem might only exist for 
some of them.
 * I'm not sure canceling slot requests and rejecting new requests are 
necessary. Wouldn't it be good enough to stop re-allocating resources for a 
while.
 ** It reduces the workload on the external systems (Kubernetes/Yarn/Mesos).
 ** It's less interrupting, without adding more complexity on the JobMasters.
 ** JobMasters are still able to decide whether they still want the resources 
on slot request timeouts.

WDYT?

> Flink's JobCluster ResourceManager doesn't use maximum-failed-containers as 
> limit of resource acquirement
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-10868
>                 URL: https://issues.apache.org/jira/browse/FLINK-10868
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Mesos, Deployment / YARN
>    Affects Versions: 1.6.2, 1.7.0
>            Reporter: Zhenqiu Huang
>            Assignee: Zhenqiu Huang
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently, YarnResourceManager does use yarn.maximum-failed-containers as 
> limit of resource acquirement. In worse case, when new start containers 
> consistently fail, YarnResourceManager will goes into an infinite resource 
> acquirement process without failing the job. Together with the 
> https://issues.apache.org/jira/browse/FLINK-10848, It will quick occupy all 
> resources of yarn queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-10868) Flink's JobCluster ResourceManager doesn't use maximum-failed-containers as limit of resource acquirement

Reply via email to