Yuan Huang  created FLINK-33880:
-----------------------------------

             Summary: Introducing Retry Mechanism for Listing TaskManager Pods 
to Prevent API Server Connection Failures
                 Key: FLINK-33880
                 URL: https://issues.apache.org/jira/browse/FLINK-33880
             Project: Flink
          Issue Type: Improvement
          Components: Deployment / Kubernetes
    Affects Versions: 1.17.2
            Reporter: Yuan Huang 
         Attachments: image-2023-12-19-18-41-41-308.png, 
image-2023-12-19-18-44-13-623.png

When operating in Kubernetes mode, if the JobManager undergoes a restart, it 
attempts to establish a connection with the API server to retrieve the complete 
list of TaskManager Pods, facilitating the recovery of previous TaskManagers.

In the context of a large Kubernetes cluster with potentially thousands of 
concurrently running jobs, a scenario may arise where all JobManagers undergo a 
restart and subsequently connect to the API server (e.g., during disaster 
recovery). This influx of requests may overwhelm the API server, reaching its 
maximum capacity and leading to the refusal of some JobManager requests. 
Consequently, certain JobManagers may experience failures and initiate 
reconnection attempts to the API server.

!image-2023-12-19-18-44-13-623.png|width=505,height=206!

To enhance this process, we can propose the implementation of a retry 
mechanism. In the event of a failed connection attempt to the API server, Flink 
will introduce a waiting period before making subsequent connection attempts, 
mitigating the risk of overwhelming the server and improving the overall 
resilience of the system.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to