Hi all, We're using Mesos 1.1.0 and have observed some unexpected behavior with regards to Agent reregistration on our cluster.
When a health check failure happens, our framework (in this case Apache Aurora) receives an Agent Lost message along with TASK_LOST messages for each of the tasks that was currently running on the agent that failed the health check (not responding after *max_agent_ping_timeouts*). We expected the same behavior to take place when an Agent does not register before the *agent_reregister_timeout* is up. However, while our framework did receive an Agent Lost message after 10 minutes had passed (default agent_reregister_timeout value) since leader election, it did not receive any messages concerning the tasks that were running on that node. This can create a scenario where, if the Agent goes away permanently, we have tasks that are unaccounted for and won't be restarted on another Agent until an explicit reconciliation is done. On the other hand, if the Agent does come back after the reregister timeout, and the framework has replaced the missing instances, the instances that were previously running will continue to run until an implicit reconciliation is done. I understand some behavior may have changed with partition aware frameworks, so I'm trying to understand if this is the expected behavior. For what is worth, Aurora is not a partition aware framework. Any help would be appreciated, Thanks! -Renan