Niklas Semmler created FLINK-25277:
--------------------------------------

             Summary: Introduce explicit shutdown signalling between 
TaskManager and JobManager 
                 Key: FLINK-25277
                 URL: https://issues.apache.org/jira/browse/FLINK-25277
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Coordination
    Affects Versions: 1.14.0, 1.13.0
            Reporter: Niklas Semmler
             Fix For: 1.15.0


We need to introduce shutdown signalling between TaskManager and JobManager for 
fast & graceful shutdown in reactive scheduler mode.

In Flink 1.14 and earlier versions, the JobManager tracks the availability of a 
TaskManager using a hearbeat. This heartbeat is by default configured with an 
interval of 10 seconds and a timeout of 50 seconds [1]. Hence, the shutdown of 
a TaskManager is recognized only after about 50-60 seconds. This works fine for 
the static scheduling mode, where a TaskManager only disappears as part of a 
cluster shutdown or a job failure. However, in the reactive scheduler mode 
(FLINK-10407), TaskManagers are regularly added and removed from a running job. 
Here, the heartbeat-mechanisms incurs additional delays.

To remove these delays, we add an explicit shutdown signal from the TaskManager 
to the JobManager. Additionally, to avoid data loss in a running job, the 
TaskManager will wait for a shutdown confirmation from the JobManager before 
shutting down.

 

[1]https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#heartbeat-timeout



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to