huwh opened a new pull request, #21565: URL: https://github.com/apache/flink/pull/21565
## What is the purpose of the change Currently, if Kubernetes/Yarn does not have enough resources to fulfill Flink's resource requirement, there will be pending pod/container requests on Kubernetes/Yarn. These pending resource requirements are never cleared until either fulfilled or the Flink cluster is shutdown. However, sometimes Flink no longer needs the pending resources. E.g., the slot request is then fulfilled by another slots that become available, or the job failed due to slot request timeout (in a session cluster). In such cases, Flink does not remove the resource request until the resource is allocated, then it discovers that it no longer needs the allocated resource and release them. This would affect the underlying Kubernetes/Yarn cluster, especially when the cluster is under heavy workload. It would be good for Flink to cancel pod/container requests as earlier as possible if it can discover that some of the pending workers are no longer needed. ## Brief change log - *SlotManager will remove unused PendingTaskManagerSlot/PendingTaskManager and then declare the resources needed to ResourceAllocator* - *ActiveResourceManager will cancel pending workers by complete requestWorkerFuture by RequestCancelledException* - *(YARN/Kubernetes)ResourceManagerDriver will stop/release the pending workers when get the RequestCancelledException* ## Verifying this change This change added tests and can be verified as follows: *(example:)* - *Added unit test for (YARN/Kubernetes)ResourceManagerDriver releasing pending worker* - *Added unit test for ActiveResouceManager dealing with declareResourceNeeded* - *Manually verified the change by running 2 registered TaskManagers, 3 pending TaskManagers, a streaming program with set slot.request.timeout to 10000, and the job will failed after slot request timeout, and the pending TaskManagers will be cancelled. Verified both in YARN/Kubernetes.* ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: SlotManager,Kubernetes/Yarn - The S3 file system connector: (no) ## Documentation - Does this pull request introduce a new feature? (yes) - If yes, how is the feature documented? - https://docs.google.com/document/d/1lcmf3MKmcmf9tsPc1whaZHMYKurGoGtqOXEep9ngP2k/edit -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org