ASF GitHub Bot commented on FLINK-4715:

GitHub user uce opened a pull request:


    [FLINK-4715] Fail TaskManager with fatal error if task cancellation is stuck

    - Splits the cancellation up into two threads:
        * The `TaskCanceler` calls `cancel` on the invokable and `interrupt` on 
the executing Thread. It then exists.
       * The `TaskCancellationWatchDog` kicks in after the task cancellation 
timeout (current default: 30 secs) and periodically calls `interrupt` on the 
executing Thread. If the Thread does not terminate within the task cancellation 
timeout (new config value, default 3 mins), the task manager is notified about 
a fatal error, leading to termination of the JVM.
    - The new configuration is exposed via 
      (default: 3 mins) and the `ExecutionConfig` (similar to the cancellation 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/uce/flink 4715-suicide

Alternatively you can review and apply these changes as the patch at:


To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2652


> TaskManager should commit suicide after cancellation failure
> ------------------------------------------------------------
>                 Key: FLINK-4715
>                 URL: https://issues.apache.org/jira/browse/FLINK-4715
>             Project: Flink
>          Issue Type: Improvement
>          Components: TaskManager
>    Affects Versions: 1.2.0
>            Reporter: Till Rohrmann
>            Assignee: Ufuk Celebi
>             Fix For: 1.2.0
> In case of a failed cancellation, e.g. the task cannot be cancelled after a 
> given time, the {{TaskManager}} should kill itself. That way we guarantee 
> that there is no resource leak. 
> This behaviour acts as a safety-net against faulty user code.

This message was sent by Atlassian JIRA

Reply via email to