[
https://issues.apache.org/jira/browse/FLINK-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15874304#comment-15874304
]
ASF GitHub Bot commented on FLINK-5830:
---------------------------------------
GitHub user zhijiangW opened a pull request:
https://github.com/apache/flink/pull/3360
[FLINK-5830][Distributed Coordination] Handle OutOfMemory error during
process async message in akka rpc actor
If caught OOM error during process async messages in **AkkaRpcActor**, it
will bring ambiguous behavior and may lost rpc messages. If the message is for
notifying final state in **TaskExecutor**, it will result in **JobMaster** can
not receive final state any more during process failing job, and may cause job
stuck in final.
The solution is to catch this special error in **AkkaRpcActor** and throw
it, then it will result in shutting down **ActorSystem** and exiting
**TaskExecutor** process. So the **JobMaster** can be aware of that and make
the job restart if necessary.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/zhijiangW/flink FLINK-5830
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/3360.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3360
----
commit 1365c6da1c456d764a3171c858bce81511ed8da5
Author: 淘江 <[email protected]>
Date: 2017-02-20T09:54:54Z
[FLINK-5830][Distributed Coordination]Handle OutOfMemory error during
process async message in akka rpc actor
----
> OutOfMemoryError during notify final state in TaskExecutor may cause job stuck
> ------------------------------------------------------------------------------
>
> Key: FLINK-5830
> URL: https://issues.apache.org/jira/browse/FLINK-5830
> Project: Flink
> Issue Type: Bug
> Reporter: zhijiang
>
> The scenario is like this:
> {{JobMaster}} tries to cancel all the executions when process failed
> execution, and the task executor already acknowledge the cancel rpc message.
> When notify the final state in {{TaskExecutor}}, it causes OOM in
> {{AkkaRpcActor}} and this error is caught to log the info. The final state
> will not be sent any more.
> The {{JobMaster}} can not receive the final state and trigger the restart
> strategy.
> One solution is to catch the {{OutOfMemoryError}} and throw it, then it will
> cause to shut down the {{ActorSystem}} resulting in exiting the
> {{TaskExecutor}}. The {{JobMaster}} can be notified of {{TaskExecutor}}
> failure and fail all the tasks to trigger restart successfully.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)