[
https://issues.apache.org/jira/browse/FLINK-12889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876799#comment-16876799
]
Till Rohrmann commented on FLINK-12889:
---------------------------------------
Thanks for reporting this issue [~xinpu] and the analysis [~zjwang]. I think
your analysis is correct. The problem is that we don't catch the last OOM when
creating the task canceller thread. This should indeed kill the TM process
because it is no longer guaranteed that this Flink process works correctly.
I would suggest to create the {{StreamTask#asyncOperationsThreadPool}} with a
{{FatalExitExceptionHandler}} as an uncaught exception handler. This should
cause the TM to exit in a situation you've described [~xinpu].
> Job keeps in FAILING state
> --------------------------
>
> Key: FLINK-12889
> URL: https://issues.apache.org/jira/browse/FLINK-12889
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Task
> Affects Versions: 1.7.2, 1.8.1, 1.9.0
> Reporter: Fan Xinpu
> Priority: Critical
> Fix For: 1.7.3, 1.8.2, 1.9.0
>
> Attachments: 20190618104945417.jpg, jobmanager.log.2019-06-16.0,
> taskmanager.log
>
>
> There is a topology of 3 operator, such as, source, parser, and persist.
> Occasionally, 5 subtasks of the source encounters exception and turns to
> failed, at the same time, one subtask of the parser runs into exception and
> turns to failed too. The jobmaster gets a message of the parser's failed. The
> jobmaster then try to cancel all the subtask, most of the subtasks of the
> three operator turns to canceled except the 5 subtasks of the source, because
> the state of the 5 ones is already FAILED before jobmaster try to cancel it.
> Then the jobmaster can not reach a final state but keeps in Failing state
> meanwhile the subtask of the source kees in canceling state.
>
> The job run on a flink 1.7 cluster on yarn, and there is only one tm with 10
> slots.
>
> The attached files contains a jm log , tm log and the ui picture.
>
> The exception timestamp is about 2019-06-16 13:42:28.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)