[jira] [Commented] (FLINK-16728) Taskmanager dies after job got stuck and canceling fails

Zhu Zhu (Jira) Wed, 25 Mar 2020 21:25:11 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-16728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17067346#comment-17067346
 ]


Zhu Zhu commented on FLINK-16728:
---------------------------------

Hi [~lilyevsky], it is intentioned to shutdown a TaskManager if task 
cancellation can not finish within timeout. This would trigger a failure and 
force the job to recovery from it rather than get stucked on it. So what matter 
is actually why the task are stucked, both in data processing and in task 
cancellation.

>From the log you attached, I think the deeper stacks of the blocked the the 
>tasks are:

org.elasticsearch.action.bulk.BulkProcessor.flush(BulkProcessor.java:433)
org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkBase.snapshotState(ElasticsearchSinkBase.java:322)
app//org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118)

So looks the task is blocked in elasticsearch flushing and thus does not 
respond to task cancellation request.

> Taskmanager dies after job got stuck and canceling fails
> --------------------------------------------------------
>
>                 Key: FLINK-16728
>                 URL: https://issues.apache.org/jira/browse/FLINK-16728
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: 1.10.0
>            Reporter: Leonid Ilyevsky
>            Priority: Major
>         Attachments: taskmanager.log.20200323.gz
>
>
> At some point I noticed that a few jobs got stuck (they basically stopped 
> processing the messages, I could detect this watching the expected output), 
> so I tried to cancel them.
> The cancel operation failed, complaining that the job got stuck at 
> StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:86)
> and then the whole taskmanager shut down.
> See the attached log.
> This is actually happening practically every day in our staging environment 
> where we are testing Flink 1.10.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-16728) Taskmanager dies after job got stuck and canceling fails

Reply via email to