[ https://issues.apache.org/jira/browse/FLINK-18625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17160862#comment-17160862 ]
Xintong Song commented on FLINK-18625: -------------------------------------- Thanks for the clarification, [~Jiangang]. Configuring absolute number of task managers makes sense to me. Would you like to prepare a PR for this ticket? > Maintain redundant taskmanagers to speed up failover > ---------------------------------------------------- > > Key: FLINK-18625 > URL: https://issues.apache.org/jira/browse/FLINK-18625 > Project: Flink > Issue Type: New Feature > Components: Runtime / Coordination > Reporter: Liu > Priority: Major > > When flink job fails because of killed taskmanagers, it will request new > containers when restarting. Requesting new containers can be very slow, > sometimes it takes dozens of seconds even more. The reasons can be different, > for example, yarn and hdfs are slow, machine performance is poor. In some > product scenario, SLA is high and failover should be in seconds. > > To speed up the recovery process, we can maintain redundant taskmanagers in > advance. When job restarts, it can use the redundant taskmanagers at once > instead of requesting new taskmanagers. > > The implemention can be done in SlotManagerImpl. Below is a brief description: > # In construct method, init redundantTaskmanagerNum from config. > # In method start(), allocate redundant taskmanagers. > # In method start(), Change taskManagerTimeoutCheck() to > redundantTaskmanagerCheck(). > # In method redundantTaskmanagerCheck(), manage redundant taskmanagers and > timeout taskmanagers. The idle taskmanager number must be not less than > redundantTaskmanagerNum. > * If less, allocate from resourceManager until equal. > * If more, release timeout taskmanagers but keep at least > redundantTaskmanagerNum idle taskmanagers. -- This message was sent by Atlassian Jira (v8.3.4#803005)