This proposes to refactor the RPC service and the coordination between
Client, JobManager, and TaskManager to use the Akka actor library.

Even though Akka is written in Scala, it offers a Java interface and we can
use Akka completely from Java.

Below are a list of arguments why this would help the system:


Problems with the current RPC service:
--------------------------------------------------------

  - No asynchronous calls with callbacks. This is the reason why several
parts of the runtime poll the status, introducing unnecessary latency.

  - No exception forwarding (many exceptions are simply swallowed), making
debugging and operation in flaky environments very hard

  - Limited number of handler threads. The RPC can only handle a fix number
of concurrent requests, forcing you to maintain separate thread pools to
delegate actions to

  - No support for primitive data types (or boxed primitives) as arguments,
everything has to be a specially serializable type

  - Problematic threading model. The RPC continuously spawns and terminates
threads



Benefits of switching to the Akka actor model:
-------------------------------------------------------------------------------

  - Akka solves all of the above issues out of the box

  - The supervisor model allows you to do failure detection of actors. That
provides a unified way of detecting and handling failures (missing
heartbeats, failed calls, ...)

  - Akka has tools to make stateful actors persistent and restart them on
other machines in cases of failure. That would greatly help in implementing
"master fail-over", which will become important

  - You can define many "call targets" (actors). Tasks (on taskmanagers)
can directly call their ExecutionVertex on the JobManager, rather than
calling the JobManager, creating a Runnable that looks up the execution
vertex, and so on...

  - The actor model's approach to queue actions on an actor and run the one
after another makes the concurrency model of the state machine very simple
and robust

  - We "outsource" our own concerns about maintaining and improving that
part of the system

Greetings,
Stephan

Reply via email to