[jira] [Updated] (FLINK-14949) Task cancellation can be stuck against out-of-thread error

Andrey Zagrebin (Jira) Tue, 26 Nov 2019 01:34:57 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-14949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andrey Zagrebin updated FLINK-14949:
------------------------------------
    Component/s:     (was: Runtime / Coordination)
                 Runtime / Task

> Task cancellation can be stuck against out-of-thread error
> ----------------------------------------------------------
>
>                 Key: FLINK-14949
>                 URL: https://issues.apache.org/jira/browse/FLINK-14949
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Task
>    Affects Versions: 1.8.2
>            Reporter: Hwanju Kim
>            Priority: Major
>
> Task cancellation 
> ([_cancelOrFailAndCancelInvokable_|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L991])
>  relies on multiple separate threads, which are _TaskCanceler_, 
> _TaskInterrupter_, and _TaskCancelerWatchdog_. While TaskCanceler performs 
> cancellation itself, TaskInterrupter periodically interrupts a non-reacting 
> task and TaskCancelerWatchdog kills JVM if cancellation has never been 
> finished within a certain amount of time (by default 3 min). Those all ensure 
> that cancellation can be done or either aborted transitioning to a terminal 
> state in finite time (FLINK-4715).
> However, if any asynchronous thread creation is failed such as by 
> out-of-thread (_java.lang.OutOfMemoryError: unable to create new native 
> thread_), the code transitions to CANCELING, but nothing could be performed 
> for cancellation or watched by watchdog. Currently, jobmanager does [retry 
> cancellation|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java#L1121]
>  against any error returned, but a next retry [returns success once it sees 
> CANCELING|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L997],
>  assuming that it is in progress. This leads to complete stuck in CANCELING, 
> which is non-terminal, so state machine is stuck after that.
> One solution would be that if a task has transitioned to CANCELLING but it 
> gets fatal error or OOM (i.e., _isJvmFatalOrOutOfMemoryError_ is true) 
> indicating that it could not reach spawning TaskCancelerWatchdog, it could 
> immediately consider that as fatal error (not safely cancellable) calling 
> _notifyFatalError_, just as TaskCancelerWatchdog does but eagerly and 
> synchronously. That way, it can at least transition out of the non-terminal 
> state and furthermore clear potentially leaked thread/memory by restarting 
> JVM. The same method is also invoked by _failExternally_, but transitioning 
> to FAILED seems less critical as it's already terminal state.
> How to reproduce is straightforward by running an application that keeps 
> creating threads, each of which never finishes in a loop, and has multiple 
> tasks so that one task triggers failure and then the others are attempted to 
> be cancelled by full fail-over. In web UI dashboard, some tasks from a task 
> manager where any of cancellation-related threads failed to be spawned are 
> stuck in CANCELLING for good.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-14949) Task cancellation can be stuck against out-of-thread error

Reply via email to