[ 
https://issues.apache.org/jira/browse/FLINK-24713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17438459#comment-17438459
 ] 

Yangze Guo commented on FLINK-24713:
------------------------------------

[~aitozi] I second Till's proposal. A configurable interval is more flexible 
than just waiting for all old TMs. I would suggest giving it a conservative 
default value for not introducing much regression of the job's failover. In our 
internal environment, we found that most of the old TMs can register back 
within 1s. So, maybe that value would be good as a first step.
Or we can disable this feature as default, users who suffer from this issue can 
configure it according to their own environment.

> Postpone resourceManager serving after the recovery phase has finished
> ----------------------------------------------------------------------
>
>                 Key: FLINK-24713
>                 URL: https://issues.apache.org/jira/browse/FLINK-24713
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.14.0
>            Reporter: Aitozi
>            Priority: Major
>
> When ResourceManager started, JobManger will connect to the ResourceManager, 
> this means the ResourceManager will begin to try serve the resource requests 
> from SlotManager.
> If ResourceManager failover, although it will try to recover the pod / 
> container from previous attempt, But new resource requirements may happen 
> before the old taskManger register to slotManager. 
> In this case, it may double the required taskManager when jobManager 
> failover. We may need a mechanism to postpone resourceManager serving after 
> the recovery phase has finished



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to