[
https://issues.apache.org/jira/browse/FLINK-18625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17160856#comment-17160856
]
Liu edited comment on FLINK-18625 at 7/20/20, 2:46 AM:
-------------------------------------------------------
Thanks for [~xintongsong] 's reply. The following is my consideration:
* Ratio config may be another good choice. It can scale according to the total
resources. But in our situation, job often fails because of taskmanager's oom.
The number of killed taskmanager is often small, for example, one or several.
Certain number of redundant task managers will be enough and the backup
resources are small. Also, we will maintain redundant taskmanagers to a certain
number through multiple failovers. Maybe both ratio config and number config
should be supported.
* I agree with the idea that redundant slots should be used more than
taskmanagers. It is more fine-grained. In this cace, maybe users need config
more slots for that idle slots can be killed when taskmanager is killed.
was (Author: jiangang):
Thanks for [~xintongsong] 's reply. The following is my consideration:
* Ratio config may be another good choice. It can scale according to the total
resources. But in our situation, job often fails because of taskmanager's oom.
The number of killed taskmanager is often small, for example, one or several.
Certain number of redundant task managers will be enough and the backup
resources are small. Also, we will maintain redundant taskmanagers to a certain
number through multiple failovers. Maybe both ratio config and number config
should be supported.
* I agree with the idea that redundant slots should be used more than
taskmanagers. It is more fine-grained. In this cace, maybe we need config more
slots for that idle slots can be killed when taskmanager is killed.
> Maintain redundant taskmanagers to speed up failover
> ----------------------------------------------------
>
> Key: FLINK-18625
> URL: https://issues.apache.org/jira/browse/FLINK-18625
> Project: Flink
> Issue Type: New Feature
> Components: Runtime / Coordination
> Reporter: Liu
> Priority: Major
>
> When flink job fails because of killed taskmanagers, it will request new
> containers when restarting. Requesting new containers can be very slow,
> sometimes it takes dozens of seconds even more. The reasons can be different,
> for example, yarn and hdfs are slow, machine performance is poor. In some
> product scenario, SLA is high and failover should be in seconds.
>
> To speed up the recovery process, we can maintain redundant taskmanagers in
> advance. When job restarts, it can use the redundant taskmanagers at once
> instead of requesting new taskmanagers.
>
> The implemention can be done in SlotManagerImpl. Below is a brief description:
> # In construct method, init redundantTaskmanagerNum from config.
> # In method start(), allocate redundant taskmanagers.
> # In method start(), Change taskManagerTimeoutCheck() to
> redundantTaskmanagerCheck().
> # In method redundantTaskmanagerCheck(), manage redundant taskmanagers and
> timeout taskmanagers. The idle taskmanager number must be not less than
> redundantTaskmanagerNum.
> * If less, allocate from resourceManager until equal.
> * If more, release timeout taskmanagers but keep at least
> redundantTaskmanagerNum idle taskmanagers.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)