Hi!

Thanks for clarifying. Here are some thoughts:

1) The akka URL should go through a non-exposes mechanism, true. In fact,
using the global configuration for the local embedded mode at all seems to
be a bad design that we should get rid of.

2) Okay, so we keep our own hearbeats in place as a means for metric
reports. At some point, we can avoid having the JobManager actor watch the
TaskManager actor then, it seems.

3) re: transport failure detector - makes sense

4) Yes, let's have a single timeout value that defines the ask timeout, tcp
timeout, and the interval of the watch failure detector, and allow to
override them by specifying the options.

Stephan


On Mon, Jan 5, 2015 at 10:30 AM, Till Rohrmann <trohrm...@apache.org> wrote:

> Hi,
>
> you are right, the new implementation still lacks a lot of documentation
> which makes understanding the code harder than necessary.
>
> On Sun, Jan 4, 2015 at 10:28 PM, Stephan Ewen <se...@apache.org> wrote:
>
> > Hi!
> >
> > Since the new distributed infrastructure is built on Akka, some internal
> > concepts have changed now.
> > I think that this is currently not really document anywhere
> >
> > @Till Can you elaborate on the questions here:
> >
> >  - What is the Akka URL in the global configuration
> ("jobmanager.akka.url")
> > From the perspective of the global configuration, don't we simply have
> the
> > address and port of the actor system?
> >
>
> The jobmanager.akka.url is used to overwrite the default akka url
> generation which is akka.tcp://${HOSTNAME}:${PORT}. This is necessary in
> cases where we do not have remote actor systems but a single local, as in
> the case of local execution, and thus have to use a different url scheme.
> In case of a single actor system, the url would be
> akka://${ACTORSYSTEMNAME}. So in fact this configuration option is only
> used internally and should not be configured by the user. To make it
> fail-safe we should probably use a non exposed mechanism.
>
>
> >
> >  - We currently have multiple competing failure-detection mechanisms: For
> > one, the job manager actor watches the task manager actors. Also, we
> still
> > have the manual heart beats in place. Shouldn't we remove the old manual
> > heartbeats and have the instance manager watch the task manager actors?
> >
>
> It's right that we still have the old heartbeats in place but they are
> stripped down. Currently, they are only used to update the
> lastReceivedHeartBeat field in the Instance object. Consequently, they
> could be simply removed at the price of not getting shown the time since
> the last heartbeat in the web interface. The failure detection mechanism is
> currently realized exclusively by using Akka's death watch, meaning that
> the JobManager watches the TaskManagers and vice versa. I also thought that
> some people wanted to piggy back on the heartbeat message to do monitoring.
> Therefore I kept it for the moment. But I guess that a dedicated monitoring
> message would be better.
>
>
> >  - There are transport heartbeats and watch heartbeats. I could not find
> a
> > good explanation of what the transport heartbeats are. Also, the
> heartbeat
> > interval is very large (1000 s) by default, so I am wondering what there
> > purpose is.
> >
>
> Yes you're right that Akka has a lot of little knobs to turn and twist and
> some of them are more obvious than others. The transport failure detector
> is Akka's own mechanism to detect lost messages. This is necessary for UDP
> but not for TCP since it has its own failure detector. In order to decrease
> the unnecessary network traffic, I set the heartbeat pause and heartbeat
> interval of the transport failure detector to these high numbers.
>
>
> >
> >  - There are many different timeouts:
> >    -> startup timeout
> >
>
> That is the timeout for creating an actor system.
>
>
> >    -> watch heartbeat timeout
> >
>
> This timeout is used for the death watch. But the detector is actually
> controlled by akka.watch.heartbeat.interval, akka.watch.heartbeat.pause and
> akka.watch.threshold. In [1] it is described what these parameters do.
>
>
> >    -> ask timeout
> >
>
> That is the general timeout which is used for all futures once the actor
> system has been started.
>
>
> >    -> TCP timeout
> >
>
> The TCP timeout is the timeout which is used by Netty for all outbound
> connections.
>
>
> >   How to the relate / interact? Does it make sense to define them
> relative
> > to one another?
>
>
> For the sake of simplicity and usability, it is a good idea to derive the
> individual timeouts by means of some heuristics from a single timeout
> value. Maybe we could use these heuristics as default values but still
> allow the user to define these values himself if he wants to.
>
>
> >
> > I think it makes a lot of sense to document these points somewhere.
> >
>
> I'll add an overview and details of the implementation to the internals
> section of the documentation.
>
>
> >
> > Greetings,
> > Stephan
> >
>
> [1]
>
> http://doc.akka.io/docs/akka/snapshot/scala/remoting.html#watching-remote-actors
>

Reply via email to