[ 
https://issues.apache.org/jira/browse/FLINK-25277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17474973#comment-17474973
 ] 

Niklas Semmler commented on FLINK-25277:
----------------------------------------

The [PR #18169 |https://github.com/apache/flink/pull/18169]introduces a minimal 
change to start the existing [TaskExecutor's 
onStop|https://github.com/apache/flink/blob/dbbf2a36111da1faea5c901e3b008cc788913bf8/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java#L444]
 method when a SIGTERM is received. This reduces the shutdown time from ~20 
seconds (waiting for timeout) to less than one second.

In an infrequently occurring scenario, the new shutdown hook introduces a race 
condition in the YARNResourceManagerDriver. When the TaskExecutor disconnects 
from the 
[ResourceManager|https://github.com/apache/flink/blob/dbbf2a36111da1faea5c901e3b008cc788913bf8/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L494],
 it [releases the TaskExecutor's YARN 
container|https://github.com/apache/flink/blob/dbbf2a36111da1faea5c901e3b008cc788913bf8/flink-yarn/src/main/java/org/apache/flink/yarn/YarnResourceManagerDriver.java#L287].
 After release attempt, it initates a 
[callback|https://github.com/apache/flink/blob/dbbf2a36111da1faea5c901e3b008cc788913bf8/flink-yarn/src/main/java/org/apache/flink/yarn/YarnResourceManagerDriver.java#L575].
 When ResourceManager terminates at the same time (due to a SIGTERM), then it 
also [terminates the 
YARNResourceManagerDriver|https://github.com/apache/flink/blob/dbbf2a36111da1faea5c901e3b008cc788913bf8/flink-yarn/src/main/java/org/apache/flink/yarn/YarnResourceManagerDriver.java#L196].
 If the release and termination process overlap, so that the termination starts 
before the container is fully released, YARN throws an exception.

To ensure that the container release process always completes before the 
termination process is started, I used a _java.util.concurrent.Phaser_ for 
synchronization. This only affects the rare cases where the two processes 
overlap.

The race condition occurred infrequently on Flink's Azure pipelines. I was 
unable to replicate it locally. However, after the addition of the 
synchronization mechanism, I did not see the race condition in about 100 
executions.

Note, that the issue is somewhat superficial: By default, YARN issues a SIGKILL 
250ms after the SIGTERM 
([here|[https://hadoop.apache.org/docs/r2.8.5/hadoop-yarn/hadoop-yarn-common/yarn-default.xml]])

> Introduce explicit shutdown signalling between TaskManager and JobManager 
> --------------------------------------------------------------------------
>
>                 Key: FLINK-25277
>                 URL: https://issues.apache.org/jira/browse/FLINK-25277
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination
>    Affects Versions: 1.13.0, 1.14.0
>            Reporter: Niklas Semmler
>            Assignee: Niklas Semmler
>            Priority: Major
>              Labels: pull-request-available, reactive
>             Fix For: 1.15.0
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> We need to introduce shutdown signalling between TaskManager and JobManager 
> for fast & graceful shutdown in reactive scheduler mode.
> In Flink 1.14 and earlier versions, the JobManager tracks the availability of 
> a TaskManager using a hearbeat. This heartbeat is by default configured with 
> an interval of 10 seconds and a timeout of 50 seconds [1]. Hence, the 
> shutdown of a TaskManager is recognized only after about 50-60 seconds. This 
> works fine for the static scheduling mode, where a TaskManager only 
> disappears as part of a cluster shutdown or a job failure. However, in the 
> reactive scheduler mode (FLINK-10407), TaskManagers are regularly added and 
> removed from a running job. Here, the heartbeat-mechanisms incurs additional 
> delays.
> To remove these delays, we add an explicit shutdown signal from the 
> TaskManager to the JobManager.
>  
> [1]https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#heartbeat-timeout



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to