[
https://issues.apache.org/jira/browse/FLINK-16511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056181#comment-17056181
]
Maximilian Michels commented on FLINK-16511:
--------------------------------------------
Awesome. Thanks for pointing me to the issue, I couldn't find it when I
searched before opening this issue. The solution looks exactly how I worked
around this.
> Task cancellation timeout is not effective on OOM errors
> --------------------------------------------------------
>
> Key: FLINK-16511
> URL: https://issues.apache.org/jira/browse/FLINK-16511
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Task
> Reporter: Maximilian Michels
> Assignee: Maximilian Michels
> Priority: Major
>
> Under high memory pressure, the task manager shutdown on fatal errors is not
> reliable:
> If a task does not cooperate and cannot be canceled and there is a OOM when
> starting the task cancellation watchdog thread, the exception is not
> propagated correctly. The reason for this is that the job manager retries the
> cancelTask() request multiple times. The operation is stateful and if we fail
> to start the watchdog thread, we won't attempt it again as the task already
> switches to the CANCELING state before starting the watchdog thread.
> Such fatal errors should automatically shutdown the task manager without a
> retry form the job manager side.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)