[
https://issues.apache.org/jira/browse/FLINK-16511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrey Zagrebin closed FLINK-16511.
-----------------------------------
Resolution: Duplicate
[~mxm]
I am closing it then as a duplicate of FLINK-14949. Please, reopen it if
needed.
> Task cancellation timeout is not effective on OOM errors
> --------------------------------------------------------
>
> Key: FLINK-16511
> URL: https://issues.apache.org/jira/browse/FLINK-16511
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Task
> Reporter: Maximilian Michels
> Assignee: Maximilian Michels
> Priority: Major
>
> Under high memory pressure, the task manager shutdown on fatal errors is not
> reliable:
> If a task does not cooperate and cannot be canceled and there is a OOM when
> starting the task cancellation watchdog thread, the exception is not
> propagated correctly. The reason for this is that the job manager retries the
> cancelTask() request multiple times. The operation is stateful and if we fail
> to start the watchdog thread, we won't attempt it again as the task already
> switches to the CANCELING state before starting the watchdog thread.
> Such fatal errors should automatically shutdown the task manager without a
> retry form the job manager side.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)