[jira] [Commented] (FLINK-18625) Maintain redundant taskmanagers to speed up failover

Xintong Song (Jira) Sun, 19 Jul 2020 20:20:02 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-18625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17160862#comment-17160862
 ]


Xintong Song commented on FLINK-18625:
--------------------------------------

Thanks for the clarification, [~Jiangang].
Configuring absolute number of task managers makes sense to me.

Would you like to prepare a PR for this ticket?

> Maintain redundant taskmanagers to speed up failover
> ----------------------------------------------------
>
>                 Key: FLINK-18625
>                 URL: https://issues.apache.org/jira/browse/FLINK-18625
>             Project: Flink
>          Issue Type: New Feature
>          Components: Runtime / Coordination
>            Reporter: Liu
>            Priority: Major
>
> When flink job fails because of killed taskmanagers, it will request new 
> containers when restarting. Requesting new containers can be very slow, 
> sometimes it takes dozens of seconds even more. The reasons can be different, 
> for example, yarn and hdfs are slow, machine performance is poor. In some 
> product scenario, SLA is high and failover should be in seconds.
>  
> To speed up the recovery process, we can maintain redundant taskmanagers in 
> advance. When job restarts, it can use the redundant taskmanagers at once 
> instead of requesting new taskmanagers.
>  
> The implemention can be done in SlotManagerImpl. Below is a brief description:
>  # In construct method, init redundantTaskmanagerNum from config.
>  # In method start(), allocate redundant taskmanagers.
>  # In method start(), Change taskManagerTimeoutCheck() to 
> redundantTaskmanagerCheck().
>  # In method redundantTaskmanagerCheck(), manage redundant taskmanagers and 
> timeout taskmanagers. The idle taskmanager number must be not less than 
> redundantTaskmanagerNum.
>  * If less, allocate from resourceManager until equal.
>  * If more, release timeout taskmanagers but keep at least 
> redundantTaskmanagerNum idle taskmanagers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18625) Maintain redundant taskmanagers to speed up failover

Reply via email to