huwh opened a new pull request, #21565:
URL: https://github.com/apache/flink/pull/21565

   
   ## What is the purpose of the change
   
   Currently, if Kubernetes/Yarn does not have enough resources to fulfill 
Flink's resource requirement, there will be pending pod/container requests on 
Kubernetes/Yarn. These pending resource requirements are never cleared until 
either fulfilled or the Flink cluster is shutdown.
   However, sometimes Flink no longer needs the pending resources. E.g., the 
slot request is then fulfilled by another slots that become available, or the 
job failed due to slot request timeout (in a session cluster). In such cases, 
Flink does not remove the resource request until the resource is allocated, 
then it discovers that it no longer needs the allocated resource and release 
them. This would affect the underlying Kubernetes/Yarn cluster, especially when 
the cluster is under heavy workload.
   It would be good for Flink to cancel pod/container requests as earlier as 
possible if it can discover that some of the pending workers are no longer 
needed.
   
   ## Brief change log
     - *SlotManager will remove unused 
PendingTaskManagerSlot/PendingTaskManager and then declare the resources needed 
to ResourceAllocator*
     - *ActiveResourceManager will cancel pending workers by complete 
requestWorkerFuture by RequestCancelledException*
     - *(YARN/Kubernetes)ResourceManagerDriver will stop/release the pending 
workers when get the RequestCancelledException*
   
   
   ## Verifying this change
   This change added tests and can be verified as follows:
   
   *(example:)*
     - *Added unit test for (YARN/Kubernetes)ResourceManagerDriver releasing 
pending worker*
     - *Added unit test for ActiveResouceManager dealing with 
declareResourceNeeded*
     - *Manually verified the change by running 2 registered TaskManagers, 3 
pending TaskManagers, a streaming program with set slot.request.timeout to 
10000, and the job will failed after slot request timeout, and the pending 
TaskManagers will be cancelled. Verified both in YARN/Kubernetes.*
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): (no)
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (no)
     - The serializers: (no)
     - The runtime per-record code paths (performance sensitive): (no)
     - Anything that affects deployment or recovery: SlotManager,Kubernetes/Yarn
     - The S3 file system connector: (no)
   
   ## Documentation
   
     - Does this pull request introduce a new feature? (yes)
     - If yes, how is the feature documented? 
       - 
https://docs.google.com/document/d/1lcmf3MKmcmf9tsPc1whaZHMYKurGoGtqOXEep9ngP2k/edit
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to