[
https://issues.apache.org/jira/browse/FLINK-18625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17159917#comment-17159917
]
Xintong Song commented on FLINK-18625:
--------------------------------------
Thanks for bring this up, [~Jiangang].
I like the idea to maintain extra back-up containers for speeding up the job
recovery. That should be helpful in timeliness sensitive scenarios.
Regarding the proposed implementation plan, I have a few suggestions.
- Instead of letting users to configure the absolute number of redundant task
managers, I would suggest to expose the configuration option as a ratio. E.g.,
if the ratio is configured to 0.2 and the job execution needs 10 TMs, then 12
TMs will be launched. The benefit is that jobs at different scales may share
the common set of configurations.
- I'm not sure about "the idle taskmanager number must be no less than
redundantTaskmanagerNum". I think what we really need is to have enough
redundant slots where the tasks can be deployed into. It could happen that we
already have enough redundant TMs but none of the TMs is idle (some TMs may
contain both allocated and free slots). In such cases, there should be enough
redundant slots and we should not keep more TMs.
WDYT?
> Maintain redundant taskmanagers to speed up failover
> ----------------------------------------------------
>
> Key: FLINK-18625
> URL: https://issues.apache.org/jira/browse/FLINK-18625
> Project: Flink
> Issue Type: New Feature
> Components: Runtime / Coordination
> Reporter: Liu
> Priority: Major
>
> When flink job fails because of killed taskmanagers, it will request new
> containers when restarting. Requesting new containers can be very slow,
> sometimes it takes dozens of seconds even more. The reasons can be different,
> for example, yarn and hdfs are slow, machine performance is poor. In some
> product scenario, SLA is high and failover should be in seconds.
>
> To speed up the recovery process, we can maintain redundant taskmanagers in
> advance. When job restarts, it can use the redundant taskmanagers at once
> instead of requesting new taskmanagers.
>
> The implemention can be done in SlotManagerImpl. Below is a brief description:
> # In construct method, init redundantTaskmanagerNum from config.
> # In method start(), allocate redundant taskmanagers.
> # In method start(), Change taskManagerTimeoutCheck() to
> redundantTaskmanagerCheck().
> # In method redundantTaskmanagerCheck(), manage redundant taskmanagers and
> timeout taskmanagers. The idle taskmanager number must be not less than
> redundantTaskmanagerNum.
> * If less, allocate from resourceManager until equal.
> * If more, release timeout taskmanagers but keep at least
> redundantTaskmanagerNum idle taskmanagers.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)