[
https://issues.apache.org/jira/browse/FLINK-18229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628727#comment-17628727
]
Weihua Hu commented on FLINK-18229:
-----------------------------------
Hi, [~xtsong] I have updated the design doc. To make the changes more clear, I
created two new issues:
1. I will split ResourceActions and make it nullable in slotManager in
[FLINK-29870|https://issues.apache.org/jira/browse/FLINK-29870] ,
2. Then i will make ResourceActions declaretive in
[FLINK-29869|https://issues.apache.org/jira/browse/FLINK-29869] ,
3. When these two are finished, cancel pending worker requests will be
supported.
> Pending worker requests should be properly cleared
> --------------------------------------------------
>
> Key: FLINK-18229
> URL: https://issues.apache.org/jira/browse/FLINK-18229
> Project: Flink
> Issue Type: Sub-task
> Components: Deployment / Kubernetes, Deployment / YARN, Runtime /
> Coordination
> Affects Versions: 1.9.3, 1.10.1, 1.11.0
> Reporter: Xintong Song
> Assignee: Weihua Hu
> Priority: Major
> Fix For: 1.17.0
>
>
> Currently, if Kubernetes/Yarn does not have enough resources to fulfill
> Flink's resource requirement, there will be pending pod/container requests on
> Kubernetes/Yarn. These pending resource requirements are never cleared until
> either fulfilled or the Flink cluster is shutdown.
> However, sometimes Flink no longer needs the pending resources. E.g., the
> slot request is then fulfilled by another slots that become available, or the
> job failed due to slot request timeout (in a session cluster). In such cases,
> Flink does not remove the resource request until the resource is allocated,
> then it discovers that it no longer needs the allocated resource and release
> them. This would affect the underlying Kubernetes/Yarn cluster, especially
> when the cluster is under heavy workload.
> It would be good for Flink to cancel pod/container requests as earlier as
> possible if it can discover that some of the pending workers are no longer
> needed.
> There are several approaches potentially achieve this.
> # We can always check whether there's a pending worker that can be canceled
> when a \{{PendingTaskManagerSlot}} is unassigned.
> # We can have a separate timeout for requesting new worker. If the resource
> cannot be allocated within the given time since requested, we should cancel
> that resource request and claim a resource allocation failure.
> # We can share the same timeout for starting new worker (proposed in
> FLINK-13554). This is similar to 2), but it requires the worker to be
> registered, rather than allocated, before timeout.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)