Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/3360 Thanks for the clarification @zhijiangW. I know understand the problem that we effectively introduce via `RpcEndpoint.runAsync` another message which might get "lost" (e.g. due to OOM exception). I agree with Stephan that it's hard to reason about the consistency of the `AkkaRpcActor s` internal state once we see an exception. The conservative approach would probably be to let it terminate or calling `notifyFatalError` to handle it. Related to this is also how we handle exceptions in the `AkkaRpcActor.handleRpcInvocation`. There we catch all exception and simply send them to the caller. I think in this method we should only send the non-fatal exceptions back and terminate otherwise. To follow a similar pattern for the `handleRunAsync` we could think about returning a `Future` which we return when calling `RpcEndpoint.runAsync` which will be completed with non fatal exceptions or if the `Runnable` has been executed. And in case that we see a fatal exception we terminate or call `notifyFatalError`. What do you think?
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---