GitHub user zhijiangW opened a pull request:
https://github.com/apache/flink/pull/3360
[FLINK-5830][Distributed Coordination] Handle OutOfMemory error during
process async message in akka rpc actor
If caught OOM error during process async messages in **AkkaRpcActor**, it
will bring ambiguous behavior and may lost rpc messages. If the message is for
notifying final state in **TaskExecutor**, it will result in **JobMaster** can
not receive final state any more during process failing job, and may cause job
stuck in final.
The solution is to catch this special error in **AkkaRpcActor** and throw
it, then it will result in shutting down **ActorSystem** and exiting
**TaskExecutor** process. So the **JobMaster** can be aware of that and make
the job restart if necessary.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/zhijiangW/flink FLINK-5830
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/3360.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3360
----
commit 1365c6da1c456d764a3171c858bce81511ed8da5
Author: æ·æ± <[email protected]>
Date: 2017-02-20T09:54:54Z
[FLINK-5830][Distributed Coordination]Handle OutOfMemory error during
process async message in akka rpc actor
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---