[jira] [Created] (FLINK-18229) Pending worker requests should be properly cleared

Xintong Song (Jira) Tue, 09 Jun 2020 22:16:32 -0700

Xintong Song created FLINK-18229:
------------------------------------

             Summary: Pending worker requests should be properly cleared
                 Key: FLINK-18229
                 URL: https://issues.apache.org/jira/browse/FLINK-18229
             Project: Flink
          Issue Type: Improvement
          Components: Deployment / Kubernetes, Deployment / YARN, Runtime / 
Coordination
    Affects Versions: 1.10.1, 1.9.3, 1.11.0
            Reporter: Xintong Song



Currently, if Kubernetes/Yarn does not have enough resources to fulfill Flink's 
resource requirement, there will be pending pod/container requests on 
Kubernetes/Yarn. These pending resource requirements are never cleared until 
either fulfilled or the Flink cluster is shutdown.

However, sometimes Flink no longer needs the pending resources. E.g., the slot 
request is then fulfilled by another slots that become available, or the job 
failed due to slot request timeout (in a session cluster). In such cases, Flink 
does not remove the resource request until the resource is allocated, then it 
discovers that it no longer needs the allocated resource and release them. This 
would affect the underlying Kubernetes/Yarn cluster, especially when the 
cluster is under heavy workload.

It would be good for Flink to cancel pod/container requests as earlier as 
possible if it can discover that some of the pending workers are no longer 
needed.

There are several approaches potentially achieve this.
 # We can always check whether there's a pending worker that can be canceled 
when a \{{PendingTaskManagerSlot}} is unassigned.
 # We can have a separate timeout for requesting new worker. If the resource 
cannot be allocated within the given time since requested, we should cancel 
that resource request and claim a resource allocation failure.
 # We can share the same timeout for starting new worker (proposed in 
FLINK-13554). This is similar to 2), but it requires the worker to be 
registered, rather than allocated, before timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-18229) Pending worker requests should be properly cleared

Reply via email to