[ 
https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046417#comment-17046417
 ] 

Xintong Song commented on FLINK-16215:
--------------------------------------

[~liuyufei],

Is it possible that we first assume all the recovered containers have TM 
process started? That means we might request less new workers than needed, and 
we can always request more when finding out a recovered container does not have 
a TM process started.

E.g., The min slots is 4, and num-slots-per-tm is 1, thus we need 4 TMs at 
least. Say if we recovered 2 containers, one has a TM process started inside 
and one doesn't have (but we don't know that yet). In that case we can assume 
for both 2 recovered containers the TM will register, so we request 2 more 
container. At the same time, we request the container status for the recovered 
containers. When the status of container in state NEW is received, we release 
it and request one more container to meet the min workers.

I would not consider the overhead of waiting for all recovered containers' 
status to be trivial, especially for large production scenarios where you may 
have thousands of containers. On the other hand, recovering a container without 
a TM process started should be a rare case (as you said, hard to reproduce), so 
I think we can afford to start new TM for it a bit later (i.e. after the status 
is received).

> Start redundant TaskExecutor when JM failed
> -------------------------------------------
>
>                 Key: FLINK-16215
>                 URL: https://issues.apache.org/jira/browse/FLINK-16215
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.10.0
>            Reporter: YufeiLiu
>            Priority: Major
>
> TaskExecutor will reconnect to the new ResourceManager leader when JM failed, 
> and JobMaster will restart and reschedule job. If job slot request arrive 
> earlier than TM registration, RM will start new workers rather than reuse the 
> existing TMs.
> It‘s hard to reproduce becasue TM registration usually come first, and 
> timeout check will stop redundant TMs. 
> But I think it would be better if we make the {{recoverWokerNode}} to 
> interface, and put recovered slots in {{pendingSlots}} wait for TM 
> reconnection.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to