"1. When mesos sends slave lost after 10 mins in this situation , why does aurora not act on it?"
Because Mesos also sends TASK_LOST for every task running on the agent whenever it calls slaveLost: When it is time to remove an agent, the master removes the agent from the list of registered agents in the master’s durable state <http://mesos.apache.org/documentation/latest/replicated-log-internals/> (this will survive master failover). The master sends a slaveLost callback to every registered scheduler driver; it also sends TASK_LOST status updates for every task that was running on the removed agent. On Thu, Jul 13, 2017 at 4:32 PM, meghdoot bhattacharya < meghdoo...@yahoo.com.invalid> wrote: > We were investigation slave re registration behavior on master fail over > in Aurora 0.17 with mesos 1.1. > Few important points > http://mesos.apache.org/documentation/latest/high- > availability-framework-guide/ (If an agent does not reregister with the > new master within a timeout (controlled by the --agent_reregister_timeout > configuration flag), the master marks the agent as failed and follows the > same steps described above. However, there is one difference: by default, > agents are allowed to reconnect following master failover, even after the > agent_reregister_timeout has fired. This means that frameworks might see a > TASK_LOST update for a task but then later discover that the task is > running (because the agent where it was running was allowed to reconnect). > http://mesos.apache.org/documentation/latest/reconciliation/(Implicit > reconciliation (passing an empty list) should also be used periodically, as > a defense against data loss in the framework. Unless a strict registry is > in use on the master, its possible for tasks to resurrect from a LOST state > (without a strict registry the master does not enforce agent removal across > failovers). When an unknown task is encountered, the scheduler should kill > or recover the task.) > https://issues.apache.org/jira/browse/MESOS-5951(Removes strict registry > mode flag from 1.1 and reverts to the old behavior of non strict registry > mode where tasks and executors were not killed on agent reregistration > timeout on master failover) > So, what we find, if the slave does not come back after 10 mins > 1. Mesos master sends slave lost but not task lost to Aurora.2. Aurora > does not replace the tasks.3. When explicit recon starts , then only this > gets corrected with aurora spawning replacement tasks. > If slave restarts after 10 mins > 1. When implicit recon starts, this situation gets fixed because in aurora > it is marked as lost and mesos sends running and those get killed and > replaced. > So, questions > 1. When mesos sends slave lost after 10 mins in this situation , why does > aurora not act on it?2. As per recon docs best practices, explicit recon > should start followed by implicit recon on master failover. Looks like > aurora is not doing that and the regular hourly recons are running with 30 > min spread between explicit and implicit. Should aurora do recon on master > fail over? > > General questions1. What is the effect on aurora if we make explicit recon > every 15 mins instead of default 1 hr? Does it slow down scheduling, does > snapshot creation gets delayed etc? > 2. Any issue if spread between explicit recon and implicit recon brought > down to 2 mins from 30 mins? probably depend on 1. > Thx > > >