[ https://issues.apache.org/jira/browse/FLINK-24713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436628#comment-17436628 ]
Aitozi commented on FLINK-24713: -------------------------------- I propose to add an interface in resourceManager to return the {{recoveryFuture}} which will be completed after all the old taskManagers registered or a specific registering timeout. After that, the slotManager can connect to resourceManager then startNewWorkers. {code:java} @Override public CompletableFuture<Acknowledge> getRecoveryFuture() { return recoveryFuture; } {code} > Postpone resourceManager serving after the recovery phase has finished > ---------------------------------------------------------------------- > > Key: FLINK-24713 > URL: https://issues.apache.org/jira/browse/FLINK-24713 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Affects Versions: 1.14.1 > Reporter: Aitozi > Priority: Major > > When ResourceManager started, JobManger will connect to the ResourceManager, > this means the ResourceManage will begin to try serve the resource requests > from SlotManager. > If ResourceManager failover, although it will try to recover the pod / > container from previous attempt, But new resource requirements may happen > before the old taskManger register to slotManager. > In this case, it may double the required taskManager when jobManager > failover. We may need a mechanism to postpone resourceManager serving after > the recovery phase has finished -- This message was sent by Atlassian Jira (v8.3.4#803005)