[
https://issues.apache.org/jira/browse/FLINK-25277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Niklas Semmler updated FLINK-25277:
-----------------------------------
Description:
We need to introduce shutdown signalling between TaskManager and JobManager for
fast & graceful shutdown in reactive scheduler mode.
In Flink 1.14 and earlier versions, the JobManager tracks the availability of a
TaskManager using a hearbeat. This heartbeat is by default configured with an
interval of 10 seconds and a timeout of 50 seconds [1]. Hence, the shutdown of
a TaskManager is recognized only after about 50-60 seconds. This works fine for
the static scheduling mode, where a TaskManager only disappears as part of a
cluster shutdown or a job failure. However, in the reactive scheduler mode
(FLINK-10407), TaskManagers are regularly added and removed from a running job.
Here, the heartbeat-mechanisms incurs additional delays.
To remove these delays, we add an explicit shutdown signal from the TaskManager
to the JobManager.
[1]https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#heartbeat-timeout
was:
We need to introduce shutdown signalling between TaskManager and JobManager for
fast & graceful shutdown in reactive scheduler mode.
In Flink 1.14 and earlier versions, the JobManager tracks the availability of a
TaskManager using a hearbeat. This heartbeat is by default configured with an
interval of 10 seconds and a timeout of 50 seconds [1]. Hence, the shutdown of
a TaskManager is recognized only after about 50-60 seconds. This works fine for
the static scheduling mode, where a TaskManager only disappears as part of a
cluster shutdown or a job failure. However, in the reactive scheduler mode
(FLINK-10407), TaskManagers are regularly added and removed from a running job.
Here, the heartbeat-mechanisms incurs additional delays.
To remove these delays, we add an explicit shutdown signal from the TaskManager
to the JobManager. Additionally, to avoid data loss in a running job, the
TaskManager will wait for a shutdown confirmation from the JobManager before
shutting down.
[1]https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#heartbeat-timeout
> Introduce explicit shutdown signalling between TaskManager and JobManager
> --------------------------------------------------------------------------
>
> Key: FLINK-25277
> URL: https://issues.apache.org/jira/browse/FLINK-25277
> Project: Flink
> Issue Type: Sub-task
> Components: Runtime / Coordination
> Affects Versions: 1.13.0, 1.14.0
> Reporter: Niklas Semmler
> Assignee: Niklas Semmler
> Priority: Major
> Labels: reactive
> Fix For: 1.15.0
>
> Original Estimate: 504h
> Remaining Estimate: 504h
>
> We need to introduce shutdown signalling between TaskManager and JobManager
> for fast & graceful shutdown in reactive scheduler mode.
> In Flink 1.14 and earlier versions, the JobManager tracks the availability of
> a TaskManager using a hearbeat. This heartbeat is by default configured with
> an interval of 10 seconds and a timeout of 50 seconds [1]. Hence, the
> shutdown of a TaskManager is recognized only after about 50-60 seconds. This
> works fine for the static scheduling mode, where a TaskManager only
> disappears as part of a cluster shutdown or a job failure. However, in the
> reactive scheduler mode (FLINK-10407), TaskManagers are regularly added and
> removed from a running job. Here, the heartbeat-mechanisms incurs additional
> delays.
> To remove these delays, we add an explicit shutdown signal from the
> TaskManager to the JobManager.
>
> [1]https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#heartbeat-timeout
--
This message was sent by Atlassian Jira
(v8.20.1#820001)