[ 
https://issues.apache.org/jira/browse/FLINK-18625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166829#comment-17166829
 ] 

Xintong Song commented on FLINK-18625:
--------------------------------------

[~trohrmann], regarding your questions.

bq. How would this feature work if the job requests heterogeneous slots which 
might result into differently sized TMs? I guess we will allocate default sized 
TMs. But what if this will prevent us from allocating fewer larger sized TMs 
which are required for fulfilling the heterogeneous slot requests?
I see your point. One optimization could be to release the redundant task 
managers if there are heterogeneous pending worker requests. The problem is 
that the redundant task manager may not be releasable if any of the slots are 
allocated (e.g., slots are evenly spread out), and even releasable it would 
cost more time to obtain the new task manager. I guess that's the price we need 
to pay if this feature is enabled. WDYT?

bq. How does this feature relate to FLINK-16605 and FLINK-15959? I believe that 
the lower and upper bounds should also limit the number of redundant slots, 
right?
According to [~Jiangang]'s PR, the upper bound also limits the number of 
redundant slots. I believe it should be the same for the lower bound. We should 
make sure of that when working on FLINK-15959. cc [~karmagyz]

> Maintain redundant taskmanagers to speed up failover
> ----------------------------------------------------
>
>                 Key: FLINK-18625
>                 URL: https://issues.apache.org/jira/browse/FLINK-18625
>             Project: Flink
>          Issue Type: New Feature
>          Components: Runtime / Coordination
>            Reporter: Liu
>            Assignee: Liu
>            Priority: Major
>              Labels: pull-request-available
>
> When flink job fails because of killed taskmanagers, it will request new 
> containers when restarting. Requesting new containers can be very slow, 
> sometimes it takes dozens of seconds even more. The reasons can be different, 
> for example, yarn and hdfs are slow, machine performance is poor. In some 
> product scenario, SLA is high and failover should be in seconds.
>  
> To speed up the recovery process, we can maintain redundant slots in advance. 
> When job restarts, it can use the redundant slots at once instead of 
> requesting new taskmanagers.
>  
> The implemention can be done in SlotManagerImpl. Below is a brief description:
>  # In construct method, init redundantTaskmanagerNum from config.
>  # In method start(), allocate redundant taskmanagers.
>  # In method start(), Change taskManagerTimeoutCheck() to 
> checkValidTaskManagers().
>  # In method checkValidTaskManagers(), manage redundant taskmanagers and 
> timeout taskmanagers. The idle taskmanager number must be not less than 
> redundantTaskmanagerNum.
>  * If less, allocate from resourceManager until equal.
>  * If more, release timeout taskmanagers but keep at least 
> redundantTaskmanagerNum idle taskmanagers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to