This proposes to refactor the RPC service and the coordination between Client, JobManager, and TaskManager to use the Akka actor library.
Even though Akka is written in Scala, it offers a Java interface and we can use Akka completely from Java. Below are a list of arguments why this would help the system: Problems with the current RPC service: -------------------------------------------------------- - No asynchronous calls with callbacks. This is the reason why several parts of the runtime poll the status, introducing unnecessary latency. - No exception forwarding (many exceptions are simply swallowed), making debugging and operation in flaky environments very hard - Limited number of handler threads. The RPC can only handle a fix number of concurrent requests, forcing you to maintain separate thread pools to delegate actions to - No support for primitive data types (or boxed primitives) as arguments, everything has to be a specially serializable type - Problematic threading model. The RPC continuously spawns and terminates threads Benefits of switching to the Akka actor model: ------------------------------------------------------------------------------- - Akka solves all of the above issues out of the box - The supervisor model allows you to do failure detection of actors. That provides a unified way of detecting and handling failures (missing heartbeats, failed calls, ...) - Akka has tools to make stateful actors persistent and restart them on other machines in cases of failure. That would greatly help in implementing "master fail-over", which will become important - You can define many "call targets" (actors). Tasks (on taskmanagers) can directly call their ExecutionVertex on the JobManager, rather than calling the JobManager, creating a Runnable that looks up the execution vertex, and so on... - The actor model's approach to queue actions on an actor and run the one after another makes the concurrency model of the state machine very simple and robust - We "outsource" our own concerns about maintaining and improving that part of the system Greetings, Stephan
