GitHub user zhijiangW opened a pull request:

    https://github.com/apache/flink/pull/3360

    [FLINK-5830][Distributed Coordination] Handle OutOfMemory error during 
process async message in akka rpc actor

    If caught OOM error during process async messages in **AkkaRpcActor**, it 
will bring ambiguous behavior and may lost rpc messages. If the message is for 
notifying final state in **TaskExecutor**, it will result in **JobMaster** can 
not receive final state any more during process failing job, and may cause job 
stuck in final.
    
    The solution is to catch this special error in **AkkaRpcActor** and throw 
it, then it will result in shutting down **ActorSystem** and exiting 
**TaskExecutor** process. So the **JobMaster** can be aware of that and make 
the job restart if necessary.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zhijiangW/flink FLINK-5830

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/3360.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3360
    
----
commit 1365c6da1c456d764a3171c858bce81511ed8da5
Author: 淘江 <[email protected]>
Date:   2017-02-20T09:54:54Z

    [FLINK-5830][Distributed Coordination]Handle OutOfMemory error during 
process async message in akka rpc actor

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to