Sounds good! On Mon, Jan 5, 2015 at 7:14 PM, Till Rohrmann <trohrm...@apache.org> wrote:
> I addressed the issues. > > On Mon, Jan 5, 2015 at 3:16 PM, Stephan Ewen <se...@apache.org> wrote: > > > Hi! > > > > Thanks for clarifying. Here are some thoughts: > > > > 1) The akka URL should go through a non-exposes mechanism, true. In fact, > > using the global configuration for the local embedded mode at all seems > to > > be a bad design that we should get rid of. > > > Sooner or later we should also get rid of the global configuration. > > > > > > 2) Okay, so we keep our own hearbeats in place as a means for metric > > reports. At some point, we can avoid having the JobManager actor watch > the > > TaskManager actor then, it seems. > > > > That is right. We can choose depending which one performs better. > > > > > > 3) re: transport failure detector - makes sense > > > > 4) Yes, let's have a single timeout value that defines the ask timeout, > tcp > > timeout, and the interval of the watch failure detector, and allow to > > override them by specifying the options. > > > > I chose the following heuristics for the moment: > > ask timeout = tcp timeout = startup timeout = death watch pause = 10 * > interval of death watch > > We can see how they behave and if necessary adapt. > > > > > > Stephan > > > > > > On Mon, Jan 5, 2015 at 10:30 AM, Till Rohrmann <trohrm...@apache.org> > > wrote: > > > > > Hi, > > > > > > you are right, the new implementation still lacks a lot of > documentation > > > which makes understanding the code harder than necessary. > > > > > > On Sun, Jan 4, 2015 at 10:28 PM, Stephan Ewen <se...@apache.org> > wrote: > > > > > > > Hi! > > > > > > > > Since the new distributed infrastructure is built on Akka, some > > internal > > > > concepts have changed now. > > > > I think that this is currently not really document anywhere > > > > > > > > @Till Can you elaborate on the questions here: > > > > > > > > - What is the Akka URL in the global configuration > > > ("jobmanager.akka.url") > > > > From the perspective of the global configuration, don't we simply > have > > > the > > > > address and port of the actor system? > > > > > > > > > > The jobmanager.akka.url is used to overwrite the default akka url > > > generation which is akka.tcp://${HOSTNAME}:${PORT}. This is necessary > in > > > cases where we do not have remote actor systems but a single local, as > in > > > the case of local execution, and thus have to use a different url > scheme. > > > In case of a single actor system, the url would be > > > akka://${ACTORSYSTEMNAME}. So in fact this configuration option is only > > > used internally and should not be configured by the user. To make it > > > fail-safe we should probably use a non exposed mechanism. > > > > > > > > > > > > > > - We currently have multiple competing failure-detection mechanisms: > > For > > > > one, the job manager actor watches the task manager actors. Also, we > > > still > > > > have the manual heart beats in place. Shouldn't we remove the old > > manual > > > > heartbeats and have the instance manager watch the task manager > actors? > > > > > > > > > > It's right that we still have the old heartbeats in place but they are > > > stripped down. Currently, they are only used to update the > > > lastReceivedHeartBeat field in the Instance object. Consequently, they > > > could be simply removed at the price of not getting shown the time > since > > > the last heartbeat in the web interface. The failure detection > mechanism > > is > > > currently realized exclusively by using Akka's death watch, meaning > that > > > the JobManager watches the TaskManagers and vice versa. I also thought > > that > > > some people wanted to piggy back on the heartbeat message to do > > monitoring. > > > Therefore I kept it for the moment. But I guess that a dedicated > > monitoring > > > message would be better. > > > > > > > > > > - There are transport heartbeats and watch heartbeats. I could not > > find > > > a > > > > good explanation of what the transport heartbeats are. Also, the > > > heartbeat > > > > interval is very large (1000 s) by default, so I am wondering what > > there > > > > purpose is. > > > > > > > > > > Yes you're right that Akka has a lot of little knobs to turn and twist > > and > > > some of them are more obvious than others. The transport failure > detector > > > is Akka's own mechanism to detect lost messages. This is necessary for > > UDP > > > but not for TCP since it has its own failure detector. In order to > > decrease > > > the unnecessary network traffic, I set the heartbeat pause and > heartbeat > > > interval of the transport failure detector to these high numbers. > > > > > > > > > > > > > > - There are many different timeouts: > > > > -> startup timeout > > > > > > > > > > That is the timeout for creating an actor system. > > > > > > > > > > -> watch heartbeat timeout > > > > > > > > > > This timeout is used for the death watch. But the detector is actually > > > controlled by akka.watch.heartbeat.interval, akka.watch.heartbeat.pause > > and > > > akka.watch.threshold. In [1] it is described what these parameters do. > > > > > > > > > > -> ask timeout > > > > > > > > > > That is the general timeout which is used for all futures once the > actor > > > system has been started. > > > > > > > > > > -> TCP timeout > > > > > > > > > > The TCP timeout is the timeout which is used by Netty for all outbound > > > connections. > > > > > > > > > > How to the relate / interact? Does it make sense to define them > > > relative > > > > to one another? > > > > > > > > > For the sake of simplicity and usability, it is a good idea to derive > the > > > individual timeouts by means of some heuristics from a single timeout > > > value. Maybe we could use these heuristics as default values but still > > > allow the user to define these values himself if he wants to. > > > > > > > > > > > > > > I think it makes a lot of sense to document these points somewhere. > > > > > > > > > > I'll add an overview and details of the implementation to the internals > > > section of the documentation. > > > > > > > > > > > > > > Greetings, > > > > Stephan > > > > > > > > > > [1] > > > > > > > > > http://doc.akka.io/docs/akka/snapshot/scala/remoting.html#watching-remote-actors > > > > > >