Hwanju Kim created FLINK-14949:
----------------------------------
Summary: Task cancellation can be stuck against out-of-thread error
Key: FLINK-14949
URL: https://issues.apache.org/jira/browse/FLINK-14949
Project: Flink
Issue Type: Bug
Components: Runtime / Coordination
Affects Versions: 1.8.2
Reporter: Hwanju Kim
Task cancellation
([_cancelOrFailAndCancelInvokable_|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L991])
relies on multiple separate threads, which are _TaskCanceler_,
_TaskInterrupter_, and _TaskCancelerWatchdog_. While TaskCanceler performs
cancellation itself, TaskInterrupter periodically interrupts a non-reacting
task and TaskCancelerWatchdog kills JVM if cancellation has never been finished
within a certain amount of time (by default 3 min). Those all ensure that
cancellation can be done or either aborted transitioning to a terminal state in
finite time (FLINK-4715).
However, if any asynchronous thread creation is failed such as by out-of-thread
(_java.lang.OutOfMemoryError: unable to create new native thread_), the code
transitions to CANCELING, but nothing could be performed for cancellation or
watched by watchdog. Currently, jobmanager does [retry
cancellation|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java#L1121]
against any error returned, but a next retry [returns success once it sees
CANCELING|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L997],
assuming that it is in progress. This leads to complete stuck in CANCELING,
which is non-terminal, so state machine is stuck after that.
One solution would be that if a task has transitioned to CANCELLING but it gets
fatal error or OOM (i.e., _isJvmFatalOrOutOfMemoryError_ is true) indicating
that it could not reach spawning TaskCancelerWatchdog, it could immediately
consider that as fatal error (not safely cancellable) calling
_notifyFatalError_, just as TaskCancelerWatchdog does but eagerly and
synchronously. That way, it can at least transition out of the non-terminal
state and furthermore clear potentially leaked thread/memory by restarting JVM.
The same method is also invoked by _failExternally_, but transitioning to
FAILED seems less critical as it's already terminal state.
How to reproduce is straightforward by running an application that keeps
creating threads, each of which never finishes in a loop, and has multiple
tasks so that one task triggers failure and then the others are attempted to be
cancelled by full fail-over. In web UI dashboard, some tasks from a task
manager where any of cancellation-related threads failed to be spawned are
stuck in CANCELLING for good.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)