[ https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17043137#comment-17043137 ]
Xintong Song commented on FLINK-16215: -------------------------------------- I share [~trohrmann]'s concern. On Yarn deployment, {{YarnResourceManager}} starts a {{TaskExecutor}} in two steps. 1. Requests a container from Yarn. 2. Launch the {{TaskExecutor}} process inside the allocated container. If the JM failover happens between the two steps, the container will be recovered but no {{TaskExecutor}} will be started inside it. I think it is a problem that for such a container, neither a {{TaskExecutor}} will be started in it, nor will it be released. This might be solved by FLINK-13554, with a timeout for starting new {{TaskExecutor}}s. We can apply this timeout to recovered containers as well. FYI, the Kubernetes deployment does not have this problem, because the pod/container is allocated and {{TaskExecutor}} is started in one step. > Start redundant TaskExecutor when JM failed > ------------------------------------------- > > Key: FLINK-16215 > URL: https://issues.apache.org/jira/browse/FLINK-16215 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.10.0 > Reporter: YufeiLiu > Priority: Major > > TaskExecutor will reconnect to the new ResourceManager leader when JM failed, > and JobMaster will restart and reschedule job. If job slot request arrive > earlier than TM registration, RM will start new workers rather than reuse the > existing TMs. > It‘s hard to reproduce becasue TM registration usually come first, and > timeout check will stop redundant TMs. > But I think it would be better if we make the {{recoverWokerNode}} to > interface, and put recovered slots in {{pendingSlots}} wait for TM > reconnection. -- This message was sent by Atlassian Jira (v8.3.4#803005)