I agree with using Akka for RPC. It is ASF 2.0 licensed, seems to have a big community [1] and users [2] that depend on the system.
The YARN client is also using the old RPC service. I would like to rewrite it with Akka once we have added it into the other parts of the system, to learn it. [1] https://github.com/akka/akka/pulls [2] http://doc.akka.io/docs/akka/2.0.4/additional/companies-using-akka.html On Fri, Sep 5, 2014 at 1:34 PM, Stephan Ewen <[email protected]> wrote: > This proposes to refactor the RPC service and the coordination between > Client, JobManager, and TaskManager to use the Akka actor library. > > Even though Akka is written in Scala, it offers a Java interface and we can > use Akka completely from Java. > > Below are a list of arguments why this would help the system: > > > Problems with the current RPC service: > -------------------------------------------------------- > > - No asynchronous calls with callbacks. This is the reason why several > parts of the runtime poll the status, introducing unnecessary latency. > > - No exception forwarding (many exceptions are simply swallowed), making > debugging and operation in flaky environments very hard > > - Limited number of handler threads. The RPC can only handle a fix number > of concurrent requests, forcing you to maintain separate thread pools to > delegate actions to > > - No support for primitive data types (or boxed primitives) as arguments, > everything has to be a specially serializable type > > - Problematic threading model. The RPC continuously spawns and terminates > threads > > > > Benefits of switching to the Akka actor model: > > ------------------------------------------------------------------------------- > > - Akka solves all of the above issues out of the box > > - The supervisor model allows you to do failure detection of actors. That > provides a unified way of detecting and handling failures (missing > heartbeats, failed calls, ...) > > - Akka has tools to make stateful actors persistent and restart them on > other machines in cases of failure. That would greatly help in implementing > "master fail-over", which will become important > > - You can define many "call targets" (actors). Tasks (on taskmanagers) > can directly call their ExecutionVertex on the JobManager, rather than > calling the JobManager, creating a Runnable that looks up the execution > vertex, and so on... > > - The actor model's approach to queue actions on an actor and run the one > after another makes the concurrency model of the state machine very simple > and robust > > - We "outsource" our own concerns about maintaining and improving that > part of the system > > Greetings, > Stephan >
