Hi!

Since the new distributed infrastructure is built on Akka, some internal
concepts have changed now.
I think that this is currently not really document anywhere

@Till Can you elaborate on the questions here:

 - What is the Akka URL in the global configuration ("jobmanager.akka.url")
>From the perspective of the global configuration, don't we simply have the
address and port of the actor system?

 - We currently have multiple competing failure-detection mechanisms: For
one, the job manager actor watches the task manager actors. Also, we still
have the manual heart beats in place. Shouldn't we remove the old manual
heartbeats and have the instance manager watch the task manager actors?

 - There are transport heartbeats and watch heartbeats. I could not find a
good explanation of what the transport heartbeats are. Also, the heartbeat
interval is very large (1000 s) by default, so I am wondering what there
purpose is.

 - There are many different timeouts:
   -> startup timeout
   -> watch heartbeat timeout
   -> ask timeout
   -> TCP timeout
  How to the relate / interact? Does it make sense to define them relative
to one another?

I think it makes a lot of sense to document these points somewhere.

Greetings,
Stephan

Reply via email to