[
https://issues.apache.org/jira/browse/FLINK-18625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17163585#comment-17163585
]
Till Rohrmann commented on FLINK-18625:
---------------------------------------
Thanks for proposing this feature [~Jiangang]. I like the idea and can see the
benefit for our users.
I have a couple of questions:
* How would this feature work if the job requests heterogeneous slots which
might result into differently sized TMs? I guess we will allocate default sized
TMs. But what if this will prevent us from allocating fewer larger sized TMs
which are required for fulfilling the heterogeneous slot requests?
* How does this feature relate to FLINK-16605 and FLINK-15959? I believe that
the lower and upper bounds should also limit the number of redundant slots,
right?
> Maintain redundant taskmanagers to speed up failover
> ----------------------------------------------------
>
> Key: FLINK-18625
> URL: https://issues.apache.org/jira/browse/FLINK-18625
> Project: Flink
> Issue Type: New Feature
> Components: Runtime / Coordination
> Reporter: Liu
> Assignee: Liu
> Priority: Major
> Labels: pull-request-available
>
> When flink job fails because of killed taskmanagers, it will request new
> containers when restarting. Requesting new containers can be very slow,
> sometimes it takes dozens of seconds even more. The reasons can be different,
> for example, yarn and hdfs are slow, machine performance is poor. In some
> product scenario, SLA is high and failover should be in seconds.
>
> To speed up the recovery process, we can maintain redundant slots in advance.
> When job restarts, it can use the redundant slots at once instead of
> requesting new taskmanagers.
>
> The implemention can be done in SlotManagerImpl. Below is a brief description:
> # In construct method, init redundantTaskmanagerNum from config.
> # In method start(), allocate redundant taskmanagers.
> # In method start(), Change taskManagerTimeoutCheck() to
> checkValidTaskManagers().
> # In method checkValidTaskManagers(), manage redundant taskmanagers and
> timeout taskmanagers. The idle taskmanager number must be not less than
> redundantTaskmanagerNum.
> * If less, allocate from resourceManager until equal.
> * If more, release timeout taskmanagers but keep at least
> redundantTaskmanagerNum idle taskmanagers.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)