[ https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17045565#comment-17045565 ]
YufeiLiu commented on FLINK-16215: ---------------------------------- [~xintongsong] I understand your concern, so we can't know how many slots will be recovered when JM failover. I came up with this because the issue we discuss before [FLINK-15959|https://issues.apache.org/jira/browse/FLINK-15959], it's hardly to know exactly how many slots are missing at startup. > Start redundant TaskExecutor when JM failed > ------------------------------------------- > > Key: FLINK-16215 > URL: https://issues.apache.org/jira/browse/FLINK-16215 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.10.0 > Reporter: YufeiLiu > Priority: Major > > TaskExecutor will reconnect to the new ResourceManager leader when JM failed, > and JobMaster will restart and reschedule job. If job slot request arrive > earlier than TM registration, RM will start new workers rather than reuse the > existing TMs. > It‘s hard to reproduce becasue TM registration usually come first, and > timeout check will stop redundant TMs. > But I think it would be better if we make the {{recoverWokerNode}} to > interface, and put recovered slots in {{pendingSlots}} wait for TM > reconnection. -- This message was sent by Atlassian Jira (v8.3.4#803005)