[jira] [Commented] (FLINK-5830) OutOfMemoryError during notify final state in TaskExecutor may cause job stuck

ASF GitHub Bot (JIRA) Mon, 20 Feb 2017 02:10:04 -0800

    [ 
https://issues.apache.org/jira/browse/FLINK-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15874304#comment-15874304
 ]


ASF GitHub Bot commented on FLINK-5830:
---------------------------------------

GitHub user zhijiangW opened a pull request:

    https://github.com/apache/flink/pull/3360

    [FLINK-5830][Distributed Coordination] Handle OutOfMemory error during 
process async message in akka rpc actor

    If caught OOM error during process async messages in **AkkaRpcActor**, it 
will bring ambiguous behavior and may lost rpc messages. If the message is for 
notifying final state in **TaskExecutor**, it will result in **JobMaster** can 
not receive final state any more during process failing job, and may cause job 
stuck in final.
    
    The solution is to catch this special error in **AkkaRpcActor** and throw 
it, then it will result in shutting down **ActorSystem** and exiting 
**TaskExecutor** process. So the **JobMaster** can be aware of that and make 
the job restart if necessary.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zhijiangW/flink FLINK-5830

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/3360.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3360
    
----
commit 1365c6da1c456d764a3171c858bce81511ed8da5
Author: 淘江 <[email protected]>
Date:   2017-02-20T09:54:54Z

    [FLINK-5830][Distributed Coordination]Handle OutOfMemory error during 
process async message in akka rpc actor

----


> OutOfMemoryError during notify final state in TaskExecutor may cause job stuck
> ------------------------------------------------------------------------------
>
>                 Key: FLINK-5830
>                 URL: https://issues.apache.org/jira/browse/FLINK-5830
>             Project: Flink
>          Issue Type: Bug
>            Reporter: zhijiang
>
> The scenario is like this:
> {{JobMaster}} tries to cancel all the executions when process failed 
> execution, and the task executor already acknowledge the cancel rpc message.
> When notify the final state in {{TaskExecutor}}, it causes OOM in 
> {{AkkaRpcActor}} and this error is caught to log the info. The final state 
> will not be sent any more.
> The {{JobMaster}} can not receive the final state and trigger the restart 
> strategy.
> One solution is to catch the {{OutOfMemoryError}} and throw it, then it will 
> cause to shut down the {{ActorSystem}} resulting in exiting the 
> {{TaskExecutor}}. The {{JobMaster}} can be notified of {{TaskExecutor}} 
> failure and fail all the tasks to trigger restart successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (FLINK-5830) OutOfMemoryError during notify final state in TaskExecutor may cause job stuck

Reply via email to