[jira] [Commented] (FLINK-16511) Task cancellation timeout is not effective on OOM errors

Andrey Zagrebin (Jira) Tue, 10 Mar 2020 10:29:23 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-16511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056171#comment-17056171
 ]


Andrey Zagrebin commented on FLINK-16511:
-----------------------------------------

[~mxm] 
 This should have been fixed for 1.9 and 1.10 (FLINK-14949). According to it, 
poring to 1.8 requires more effort.

> Task cancellation timeout is not effective on OOM errors
> --------------------------------------------------------
>
>                 Key: FLINK-16511
>                 URL: https://issues.apache.org/jira/browse/FLINK-16511
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Task
>            Reporter: Maximilian Michels
>            Assignee: Maximilian Michels
>            Priority: Major
>
> Under high memory pressure, the task manager shutdown on fatal errors is not 
> reliable:
> If a task does not cooperate and cannot be canceled and there is a OOM when 
> starting the task cancellation watchdog thread, the exception is not 
> propagated correctly. The reason for this is that the job manager retries the 
> cancelTask() request multiple times. The operation is stateful and if we fail 
> to start the watchdog thread, we won't attempt it again as the task already 
> switches to the CANCELING state before starting the watchdog thread.
> Such fatal errors should automatically shutdown the task manager without a 
> retry form the job manager side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-16511) Task cancellation timeout is not effective on OOM errors

Reply via email to