[
https://issues.apache.org/jira/browse/FLINK-14949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrey Zagrebin updated FLINK-14949:
------------------------------------
Component/s: (was: Runtime / Coordination)
Runtime / Task
> Task cancellation can be stuck against out-of-thread error
> ----------------------------------------------------------
>
> Key: FLINK-14949
> URL: https://issues.apache.org/jira/browse/FLINK-14949
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Task
> Affects Versions: 1.8.2
> Reporter: Hwanju Kim
> Priority: Major
>
> Task cancellation
> ([_cancelOrFailAndCancelInvokable_|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L991])
> relies on multiple separate threads, which are _TaskCanceler_,
> _TaskInterrupter_, and _TaskCancelerWatchdog_. While TaskCanceler performs
> cancellation itself, TaskInterrupter periodically interrupts a non-reacting
> task and TaskCancelerWatchdog kills JVM if cancellation has never been
> finished within a certain amount of time (by default 3 min). Those all ensure
> that cancellation can be done or either aborted transitioning to a terminal
> state in finite time (FLINK-4715).
> However, if any asynchronous thread creation is failed such as by
> out-of-thread (_java.lang.OutOfMemoryError: unable to create new native
> thread_), the code transitions to CANCELING, but nothing could be performed
> for cancellation or watched by watchdog. Currently, jobmanager does [retry
> cancellation|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java#L1121]
> against any error returned, but a next retry [returns success once it sees
> CANCELING|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L997],
> assuming that it is in progress. This leads to complete stuck in CANCELING,
> which is non-terminal, so state machine is stuck after that.
> One solution would be that if a task has transitioned to CANCELLING but it
> gets fatal error or OOM (i.e., _isJvmFatalOrOutOfMemoryError_ is true)
> indicating that it could not reach spawning TaskCancelerWatchdog, it could
> immediately consider that as fatal error (not safely cancellable) calling
> _notifyFatalError_, just as TaskCancelerWatchdog does but eagerly and
> synchronously. That way, it can at least transition out of the non-terminal
> state and furthermore clear potentially leaked thread/memory by restarting
> JVM. The same method is also invoked by _failExternally_, but transitioning
> to FAILED seems less critical as it's already terminal state.
> How to reproduce is straightforward by running an application that keeps
> creating threads, each of which never finishes in a loop, and has multiple
> tasks so that one task triggers failure and then the others are attempted to
> be cancelled by full fail-over. In web UI dashboard, some tasks from a task
> manager where any of cancellation-related threads failed to be spawned are
> stuck in CANCELLING for good.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)