Hi, you are right, the new implementation still lacks a lot of documentation which makes understanding the code harder than necessary.
On Sun, Jan 4, 2015 at 10:28 PM, Stephan Ewen <se...@apache.org> wrote: > Hi! > > Since the new distributed infrastructure is built on Akka, some internal > concepts have changed now. > I think that this is currently not really document anywhere > > @Till Can you elaborate on the questions here: > > - What is the Akka URL in the global configuration ("jobmanager.akka.url") > From the perspective of the global configuration, don't we simply have the > address and port of the actor system? > The jobmanager.akka.url is used to overwrite the default akka url generation which is akka.tcp://${HOSTNAME}:${PORT}. This is necessary in cases where we do not have remote actor systems but a single local, as in the case of local execution, and thus have to use a different url scheme. In case of a single actor system, the url would be akka://${ACTORSYSTEMNAME}. So in fact this configuration option is only used internally and should not be configured by the user. To make it fail-safe we should probably use a non exposed mechanism. > > - We currently have multiple competing failure-detection mechanisms: For > one, the job manager actor watches the task manager actors. Also, we still > have the manual heart beats in place. Shouldn't we remove the old manual > heartbeats and have the instance manager watch the task manager actors? > It's right that we still have the old heartbeats in place but they are stripped down. Currently, they are only used to update the lastReceivedHeartBeat field in the Instance object. Consequently, they could be simply removed at the price of not getting shown the time since the last heartbeat in the web interface. The failure detection mechanism is currently realized exclusively by using Akka's death watch, meaning that the JobManager watches the TaskManagers and vice versa. I also thought that some people wanted to piggy back on the heartbeat message to do monitoring. Therefore I kept it for the moment. But I guess that a dedicated monitoring message would be better. > - There are transport heartbeats and watch heartbeats. I could not find a > good explanation of what the transport heartbeats are. Also, the heartbeat > interval is very large (1000 s) by default, so I am wondering what there > purpose is. > Yes you're right that Akka has a lot of little knobs to turn and twist and some of them are more obvious than others. The transport failure detector is Akka's own mechanism to detect lost messages. This is necessary for UDP but not for TCP since it has its own failure detector. In order to decrease the unnecessary network traffic, I set the heartbeat pause and heartbeat interval of the transport failure detector to these high numbers. > > - There are many different timeouts: > -> startup timeout > That is the timeout for creating an actor system. > -> watch heartbeat timeout > This timeout is used for the death watch. But the detector is actually controlled by akka.watch.heartbeat.interval, akka.watch.heartbeat.pause and akka.watch.threshold. In [1] it is described what these parameters do. > -> ask timeout > That is the general timeout which is used for all futures once the actor system has been started. > -> TCP timeout > The TCP timeout is the timeout which is used by Netty for all outbound connections. > How to the relate / interact? Does it make sense to define them relative > to one another? For the sake of simplicity and usability, it is a good idea to derive the individual timeouts by means of some heuristics from a single timeout value. Maybe we could use these heuristics as default values but still allow the user to define these values himself if he wants to. > > I think it makes a lot of sense to document these points somewhere. > I'll add an overview and details of the implementation to the internals section of the documentation. > > Greetings, > Stephan > [1] http://doc.akka.io/docs/akka/snapshot/scala/remoting.html#watching-remote-actors