Hi! Since the new distributed infrastructure is built on Akka, some internal concepts have changed now. I think that this is currently not really document anywhere
@Till Can you elaborate on the questions here: - What is the Akka URL in the global configuration ("jobmanager.akka.url") >From the perspective of the global configuration, don't we simply have the address and port of the actor system? - We currently have multiple competing failure-detection mechanisms: For one, the job manager actor watches the task manager actors. Also, we still have the manual heart beats in place. Shouldn't we remove the old manual heartbeats and have the instance manager watch the task manager actors? - There are transport heartbeats and watch heartbeats. I could not find a good explanation of what the transport heartbeats are. Also, the heartbeat interval is very large (1000 s) by default, so I am wondering what there purpose is. - There are many different timeouts: -> startup timeout -> watch heartbeat timeout -> ask timeout -> TCP timeout How to the relate / interact? Does it make sense to define them relative to one another? I think it makes a lot of sense to document these points somewhere. Greetings, Stephan