[jira] [Commented] (FLINK-16511) Task cancellation timeout is not effective on OOM errors

Maximilian Michels (Jira) Tue, 10 Mar 2020 10:45:24 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-16511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056181#comment-17056181
 ]


Maximilian Michels commented on FLINK-16511:
--------------------------------------------

Awesome. Thanks for pointing me to the issue, I couldn't find it when I 
searched before opening this issue. The solution looks exactly how I worked 
around this.

> Task cancellation timeout is not effective on OOM errors
> --------------------------------------------------------
>
>                 Key: FLINK-16511
>                 URL: https://issues.apache.org/jira/browse/FLINK-16511
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Task
>            Reporter: Maximilian Michels
>            Assignee: Maximilian Michels
>            Priority: Major
>
> Under high memory pressure, the task manager shutdown on fatal errors is not 
> reliable:
> If a task does not cooperate and cannot be canceled and there is a OOM when 
> starting the task cancellation watchdog thread, the exception is not 
> propagated correctly. The reason for this is that the job manager retries the 
> cancelTask() request multiple times. The operation is stateful and if we fail 
> to start the watchdog thread, we won't attempt it again as the task already 
> switches to the CANCELING state before starting the watchdog thread.
> Such fatal errors should automatically shutdown the task manager without a 
> retry form the job manager side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-16511) Task cancellation timeout is not effective on OOM errors

Reply via email to