[ 
https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17043134#comment-17043134
 ] 

Yang Wang commented on FLINK-16215:
-----------------------------------

I think even we make {{recoverWokerNode}} as interface and do the recovery 
before slot request coming, we still could not completely avoid this problem. 
Since there is no guarantee that we could get all the previous containers from 
the recovery process. Some other containers may also be returned via the 
subsequent heartbeat.

Maybe the {{JobMaster}} should be aware of the failover and could recover the 
running from {{TaskManager}}. If it fails with timeout, then allocate a new 
slot from {{ResourceManager}}. It is just a rough thought. Please correct me if 
i am wrong.

 

> Start redundant TaskExecutor when JM failed
> -------------------------------------------
>
>                 Key: FLINK-16215
>                 URL: https://issues.apache.org/jira/browse/FLINK-16215
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.10.0
>            Reporter: YufeiLiu
>            Priority: Major
>
> TaskExecutor will reconnect to the new ResourceManager leader when JM failed, 
> and JobMaster will restart and reschedule job. If job slot request arrive 
> earlier than TM registration, RM will start new workers rather than reuse the 
> existing TMs.
> It‘s hard to reproduce becasue TM registration usually come first, and 
> timeout check will stop redundant TMs. 
> But I think it would be better if we make the {{recoverWokerNode}} to 
> interface, and put recovered slots in {{pendingSlots}} wait for TM 
> reconnection.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to