[
https://issues.apache.org/jira/browse/FLINK-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17245067#comment-17245067
]
Xintong Song commented on FLINK-10868:
--------------------------------------
Hi [~hpeter],
I've gone through the previous discussions and the PR. Here's how I understand
the problem and the proposed changes. Please correct me if I'm wrong.
* Problem
** Flink may encounter failures trying to start new task managers. For some of
these failures, an immediate retry may not help. E.g., files on HDFS are
temporally/permanently unavailable.
** In such cases, Flink will keep allocating new containers from Yarn and
release them due to failures in starting them.
*** The retrying should no longer be infinite. It should stop when the slot
requests are timed out, because Flink now always checks the pending requests
before re-requesting resources.
* Solution
** The proposed approach leverages a `FailureRater` that counts starting task
manager failures within a certain time to decide whether the immediate retry is
desired.
** If a high container failure rate is detected recently, Flink stops
allocating new containers by canceling all pending slot requests and stop
accepting new requests for a while.
** It leaves the decision whether and when to re-request resources to the job
masters.
FYI, we have encountered and solved a similar problem with Kubernetes API
server being temporarily available. The two steps on Yarn, allocating and
starting a container, are combined into one step on Kubernetes, adding a new
pod. If the Kubernetes API server is temporarily unavailable (e.g., due to too
much load), Flink's could retry immediately over and over again. This not only
keeps the JM cpu busy, but also adds more load to the Kubernetes API server.
We solved the problem with a different approach. Whenever an adding pod failure
happened, Flink stops adding new pods for a "cool down time", during which
attempts to add new pods are cached and re-triggered after the cool down. In
general, instead of canceling pending requests and rejecting new requests, it
simply takes a break before retrying. See
`KubernetesResourceManagerDriver#podCreationCoolDown` for details.
We have kept the solution within `KubernetesResourceManagerDriver` because we
thought it is specific to the native Kubernetes deployment. If Yarn/Mesos have
similar problems, it makes sense to move the approach to the common
`AbstractResourceManagerDriver` or `ActiveResourceManager`.
I have a feeling that these two approaches can be complementary.
* The idea of `FailureRater` is more advanced compared to reacting to every
single failure. Allowing occasional failures are significant, especially on
Yarn where Flink talks to many Yarn NMs and the problem might only exist for
some of them.
* I'm not sure canceling slot requests and rejecting new requests are
necessary. Wouldn't it be good enough to stop re-allocating resources for a
while.
** It reduces the workload on the external systems (Kubernetes/Yarn/Mesos).
** It's less interrupting, without adding more complexity on the JobMasters.
** JobMasters are still able to decide whether they still want the resources
on slot request timeouts.
WDYT?
> Flink's JobCluster ResourceManager doesn't use maximum-failed-containers as
> limit of resource acquirement
> ---------------------------------------------------------------------------------------------------------
>
> Key: FLINK-10868
> URL: https://issues.apache.org/jira/browse/FLINK-10868
> Project: Flink
> Issue Type: Bug
> Components: Deployment / Mesos, Deployment / YARN
> Affects Versions: 1.6.2, 1.7.0
> Reporter: Zhenqiu Huang
> Assignee: Zhenqiu Huang
> Priority: Major
> Labels: pull-request-available
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> Currently, YarnResourceManager does use yarn.maximum-failed-containers as
> limit of resource acquirement. In worse case, when new start containers
> consistently fail, YarnResourceManager will goes into an infinite resource
> acquirement process without failing the job. Together with the
> https://issues.apache.org/jira/browse/FLINK-10848, It will quick occupy all
> resources of yarn queue.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)