Aurora reconciliation and Master fail over

meghdoot bhattacharya Thu, 13 Jul 2017 16:33:06 -0700

We were investigation slave re registration behavior on master fail over in 
Aurora 0.17 with mesos 1.1.
Few important points
http://mesos.apache.org/documentation/latest/high-availability-framework-guide/ 
(If an agent does not reregister with the new master within a timeout 
(controlled by the --agent_reregister_timeout configuration flag), the master 
marks the agent as failed and follows the same steps described above. However, 
there is one difference: by default, agents are allowed to reconnect following 
master failover, even after the agent_reregister_timeout has fired. This means 
that frameworks might see a TASK_LOST update for a task but then later discover 
that the task is running (because the agent where it was running was allowed to 
reconnect).
http://mesos.apache.org/documentation/latest/reconciliation/(Implicit 
reconciliation (passing an empty list) should also be used periodically, as a 
defense against data loss in the framework. Unless a strict registry is in use 
on the master, its possible for tasks to resurrect from a LOST state (without a 
strict registry the master does not enforce agent removal across failovers). 
When an unknown task is encountered, the scheduler should kill or recover the 
task.)
https://issues.apache.org/jira/browse/MESOS-5951(Removes strict registry mode 
flag from 1.1 and reverts to the old behavior of non strict registry mode where 
tasks and executors were not killed on agent reregistration timeout on master 
failover)
So, what we find, if the slave does not come back after 10 mins
1. Mesos master sends slave lost but not task lost to Aurora.2. Aurora does not 
replace the tasks.3. When explicit recon starts , then only this gets corrected 
with aurora spawning replacement tasks.
If slave restarts after 10 mins
1. When implicit recon starts, this situation gets fixed because in aurora it 
is marked as lost and mesos sends running and those get killed and replaced.
So, questions
1. When mesos sends slave lost after 10 mins in this situation , why does 
aurora not act on it?2. As per recon docs best practices, explicit recon should 
start followed by implicit recon on master failover. Looks like aurora is not 
doing that and the regular hourly recons are running with 30 min spread between 
explicit and implicit. Should aurora do recon on master fail over?


General questions1. What is the effect on aurora if we make explicit recon 
every 15 mins instead of default 1 hr? Does it slow down scheduling, does 
snapshot creation gets delayed etc?
2. Any issue if spread between explicit recon and implicit recon brought down 
to 2 mins from 30 mins? probably depend on 1.
Thx

Aurora reconciliation and Master fail over

Reply via email to