Hey Daniel, On Wed, Sep 3, 2014 at 11:48 PM, Daniel Warneke <[email protected]> wrote:
> quite frankly, I still don’t understand what concrete problems in Flink we > are trying to solve with introducing akka, or even worse, reimplementing > the JobManager and TaskManager in Scala. In my opinion, it is crucial to > clarify that before the vote starts. > I think we are all on the same page and Till started this thread to clarify the issues. It's unfortunate that we are having two interdependent discussions in one thread though. 1. Akka: The initial (orthogonal) issue, which initiated this thread is the question whether to replace our current RPC system with Akka ( https://issues.apache.org/jira/browse/FLINK-1019). I think that the points given by both Till and Stephan about Akka in this thread are valid technical reasons for a transition. They highlight both problems with the current RPC (threading, exception handling) and potential free lunches (heartbeat, supervision). > First, it is unclear to me why akka has such a strong standing in the > project that we are seriously contemplating if it is worth to introduce the > complexity of a second programming language to the very core of the system. > An RPC service is a total commodity component these days. Any other RPC > service could essentially do the job. Did somebody have a look at the > alternatives (kryo, Netty, …)? > I'm not sure if Till or Stephan considered alternatives, but given the above points Akka seems to be a good fit. Regarding Netty and Kryo: - We have replaced the custom TCP network code of the system with Netty some time ago and from my experience with Netty I don't think that it's a good fit for replacing the RPC service as it won't solve the low-level issues we are having right now. Instead it would just wrap them in a nice library and add complexity for message queing etc. - Kryo is imo just a serialization framework and KryoNet would be a competitor to Netty and not Akka. Second, I think it is also a misconception to think that the current RPC > service is a major source of scalability and latency issues. Most of the > scalability/latency problems we see arise from the currently rather > complex/clumsy way of traversing Flink’s internal scheduling structures > (i.e. the ExecutionGraph) upon status updates. The scheduling structure is > inherently shared state at the moment, so unless somebody wants to > reimplement it using actors and message passing, I don’t see how either > akka or Scala could help us here. > I think we have already replaced parts of the ExecutionGraph lookup structures to counter these problems. Stephan is currently working on reworking the scheduler (https://issues.apache.org/jira/browse/FLINK-1030) and if I'm not mistaken he has plans to make use of the Actor model for this. 2. Scala: In order to make use of some Akka provides, the JobManager and TaskManager need to be refactored to be Akka Actors. This refactoring can be done in Java or Scala. The original question of this thread is whether it would be OK to do it in Scala. Points both in favor and against this have been raised here and led to the corresponding [VOTE] thread. Since the [VOTE] is currently running I would suggest to move the discussion for this specific there.
