Xintong Song created FLINK-18229:
------------------------------------
Summary: Pending worker requests should be properly cleared
Key: FLINK-18229
URL: https://issues.apache.org/jira/browse/FLINK-18229
Project: Flink
Issue Type: Improvement
Components: Deployment / Kubernetes, Deployment / YARN, Runtime /
Coordination
Affects Versions: 1.10.1, 1.9.3, 1.11.0
Reporter: Xintong Song
Currently, if Kubernetes/Yarn does not have enough resources to fulfill Flink's
resource requirement, there will be pending pod/container requests on
Kubernetes/Yarn. These pending resource requirements are never cleared until
either fulfilled or the Flink cluster is shutdown.
However, sometimes Flink no longer needs the pending resources. E.g., the slot
request is then fulfilled by another slots that become available, or the job
failed due to slot request timeout (in a session cluster). In such cases, Flink
does not remove the resource request until the resource is allocated, then it
discovers that it no longer needs the allocated resource and release them. This
would affect the underlying Kubernetes/Yarn cluster, especially when the
cluster is under heavy workload.
It would be good for Flink to cancel pod/container requests as earlier as
possible if it can discover that some of the pending workers are no longer
needed.
There are several approaches potentially achieve this.
# We can always check whether there's a pending worker that can be canceled
when a \{{PendingTaskManagerSlot}} is unassigned.
# We can have a separate timeout for requesting new worker. If the resource
cannot be allocated within the given time since requested, we should cancel
that resource request and claim a resource allocation failure.
# We can share the same timeout for starting new worker (proposed in
FLINK-13554). This is similar to 2), but it requires the worker to be
registered, rather than allocated, before timeout.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)