We were investigation slave re registration behavior on master fail over in Aurora 0.17 with mesos 1.1. Few important points http://mesos.apache.org/documentation/latest/high-availability-framework-guide/ (If an agent does not reregister with the new master within a timeout (controlled by the --agent_reregister_timeout configuration flag), the master marks the agent as failed and follows the same steps described above. However, there is one difference: by default, agents are allowed to reconnect following master failover, even after the agent_reregister_timeout has fired. This means that frameworks might see a TASK_LOST update for a task but then later discover that the task is running (because the agent where it was running was allowed to reconnect). http://mesos.apache.org/documentation/latest/reconciliation/(Implicit reconciliation (passing an empty list) should also be used periodically, as a defense against data loss in the framework. Unless a strict registry is in use on the master, its possible for tasks to resurrect from a LOST state (without a strict registry the master does not enforce agent removal across failovers). When an unknown task is encountered, the scheduler should kill or recover the task.) https://issues.apache.org/jira/browse/MESOS-5951(Removes strict registry mode flag from 1.1 and reverts to the old behavior of non strict registry mode where tasks and executors were not killed on agent reregistration timeout on master failover) So, what we find, if the slave does not come back after 10 mins 1. Mesos master sends slave lost but not task lost to Aurora.2. Aurora does not replace the tasks.3. When explicit recon starts , then only this gets corrected with aurora spawning replacement tasks. If slave restarts after 10 mins 1. When implicit recon starts, this situation gets fixed because in aurora it is marked as lost and mesos sends running and those get killed and replaced. So, questions 1. When mesos sends slave lost after 10 mins in this situation , why does aurora not act on it?2. As per recon docs best practices, explicit recon should start followed by implicit recon on master failover. Looks like aurora is not doing that and the regular hourly recons are running with 30 min spread between explicit and implicit. Should aurora do recon on master fail over?
General questions1. What is the effect on aurora if we make explicit recon every 15 mins instead of default 1 hr? Does it slow down scheduling, does snapshot creation gets delayed etc? 2. Any issue if spread between explicit recon and implicit recon brought down to 2 mins from 30 mins? probably depend on 1. Thx