[ 
https://issues.apache.org/jira/browse/FLINK-14949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16982301#comment-16982301
 ] 

Andrey Zagrebin commented on FLINK-14949:
-----------------------------------------

[~hwanju]
Thanks for filing this problem. It looks like a bug in Flink. The proposed 
solution also looks good because if we are unable to spawn the cancelation 
threads we cannot do much about this except fatally terminating JVM. After 
talking to [~pnowojski], we do not have plans to handle the cancelation 
differently at the moment so we have to introduce another try/catch surrounding 
spawning the cancelation threads.

Do you have time to work on the suggested fix for this and want to be assigned 
to the issue?

> Task cancellation can be stuck against out-of-thread error
> ----------------------------------------------------------
>
>                 Key: FLINK-14949
>                 URL: https://issues.apache.org/jira/browse/FLINK-14949
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Task
>    Affects Versions: 1.8.2
>            Reporter: Hwanju Kim
>            Priority: Major
>
> Task cancellation 
> ([_cancelOrFailAndCancelInvokable_|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L991])
>  relies on multiple separate threads, which are _TaskCanceler_, 
> _TaskInterrupter_, and _TaskCancelerWatchdog_. While TaskCanceler performs 
> cancellation itself, TaskInterrupter periodically interrupts a non-reacting 
> task and TaskCancelerWatchdog kills JVM if cancellation has never been 
> finished within a certain amount of time (by default 3 min). Those all ensure 
> that cancellation can be done or either aborted transitioning to a terminal 
> state in finite time (FLINK-4715).
> However, if any asynchronous thread creation is failed such as by 
> out-of-thread (_java.lang.OutOfMemoryError: unable to create new native 
> thread_), the code transitions to CANCELING, but nothing could be performed 
> for cancellation or watched by watchdog. Currently, jobmanager does [retry 
> cancellation|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java#L1121]
>  against any error returned, but a next retry [returns success once it sees 
> CANCELING|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L997],
>  assuming that it is in progress. This leads to complete stuck in CANCELING, 
> which is non-terminal, so state machine is stuck after that.
> One solution would be that if a task has transitioned to CANCELLING but it 
> gets fatal error or OOM (i.e., _isJvmFatalOrOutOfMemoryError_ is true) 
> indicating that it could not reach spawning TaskCancelerWatchdog, it could 
> immediately consider that as fatal error (not safely cancellable) calling 
> _notifyFatalError_, just as TaskCancelerWatchdog does but eagerly and 
> synchronously. That way, it can at least transition out of the non-terminal 
> state and furthermore clear potentially leaked thread/memory by restarting 
> JVM. The same method is also invoked by _failExternally_, but transitioning 
> to FAILED seems less critical as it's already terminal state.
> How to reproduce is straightforward by running an application that keeps 
> creating threads, each of which never finishes in a loop, and has multiple 
> tasks so that one task triggers failure and then the others are attempted to 
> be cancelled by full fail-over. In web UI dashboard, some tasks from a task 
> manager where any of cancellation-related threads failed to be spawned are 
> stuck in CANCELLING for good.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to