[
https://issues.apache.org/jira/browse/FLINK-18226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133066#comment-17133066
]
Xintong Song commented on FLINK-18226:
--------------------------------------
Some updates on this ticket.
I ran into some YARN test failures. Turns out we cannot simply move the
decreasing of pending worker count from on-worker-allocated to
on-worker-registered. The problem is that, {{YarnResourceManager}} relies on
this counter for deciding which worker to start.
A yarn container resource {{c1}} (e.g., <4GB>) might corresponds to multiple
worker resource spec {{w1}}, {{w2}} (e.g., <1GB heap, 3GB off-heap> & <3GB
heap, 1GB off-heap>). Say if the number of pending workers is <w1 : 1, w2 : 1>,
that means the number of pending containers <c1 : 2>. When a container of
resource {{c1}} is allocated, YarnRM choose one from {{w1}} and {{w2}} and
starts a TM accordingly. Let's say it picks {{w1}}. Then the pending workers
become <w1 : 0, w2 :1> & <c1 : 1>. In this way, when another container of
{{c1}} is allocated, YarnRM knows it should starts a TM with {{w2}} rather than
{{w1}}.
If we do not decrease the counter on worker registered rather than allocated,
then YarnRM won't be able to know which worker resource spec should the TM be
started with when the second container is allocated before the first one
registered.
> ResourceManager requests unnecessary new workers if previous workers are
> allocated but not registered.
> ------------------------------------------------------------------------------------------------------
>
> Key: FLINK-18226
> URL: https://issues.apache.org/jira/browse/FLINK-18226
> Project: Flink
> Issue Type: Bug
> Components: Deployment / Kubernetes, Deployment / YARN, Runtime /
> Coordination
> Affects Versions: 1.11.0
> Reporter: Xintong Song
> Assignee: Xintong Song
> Priority: Blocker
> Fix For: 1.11.0
>
>
> h2. Problem
> Currently on Kubernetes & Yarn deployment, the ResourceManager compares
> *pending workers requested from Kubernetes/Yarn* against *pending workers
> required by SlotManager*, for deciding whether new workers should be
> requested in case of a worker failure.
> * {{KubernetesResourceManager#requestKubernetesPodIfRequired}}
> * {{YarnResourceManager#requestYarnContainerIfRequired}}
> *Pending workers requested from Kubernetes/Yarn* is decreased when the worker
> is allocated, *before the worker is actually started and registered*.
> * Decreased in {{ActiveResourceManager#notifyNewWorkerAllocated}}, which is
> called in
> * {{KubernetesResourceManager#onAdded}}
> * {{YarnResourceManager#onContainersOfResourceAllocated}}
> On the other hand, *pending workers required by SlotManager* is derived from
> the number of pending slots inside SlotManager, which is decreased *when the
> new workers/slots are registered*.
> * {{SlotManagerImpl#registerSlot}}
> Therefore, if a worker {{w1}} is failed after another worker {{w2}} is
> allocated but before {{w2}} is registered, the ResourceManager will request
> an unnecessary new worker for {{w2}}.
> h2. Impact
> Normally, the extra worker should be released soon after allocated. But in
> cases where the Kubernetes/Yarn cluster does not have enough resources, it
> might create more and more pending pods/containers.
> It's even more severe for Kubernetes, because
> {{KubernetesResourceManager#onAdded}} only suggest that the pod spec has been
> successfully added to the cluster, but the pod may not actually been
> allocated due to lack of resources. Imagine there are {{N}} pending pods, a
> failure of a running pod means requesting another {{N}} new pods.
> In a session cluster, such pending pods could take long to be cleared even
> after all jobs in the session cluster have terminated.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)