Re: Aurora reconciliation and Master fail over

David McLaughlin Fri, 14 Jul 2017 08:21:54 -0700

"1. When mesos sends slave lost after 10 mins in this situation , why does
aurora not act on it?"


Because Mesos also sends TASK_LOST for every task running on the agent
whenever it calls slaveLost:

When it is time to remove an agent, the master removes the agent from the
list of registered agents in the master’s durable state
<http://mesos.apache.org/documentation/latest/replicated-log-internals/> (this
will survive master failover). The master sends a slaveLost callback to
every registered scheduler driver; it also sends TASK_LOST status updates
for every task that was running on the removed agent.




On Thu, Jul 13, 2017 at 4:32 PM, meghdoot bhattacharya <
meghdoo...@yahoo.com.invalid> wrote:

> We were investigation slave re registration behavior on master fail over
> in Aurora 0.17 with mesos 1.1.
> Few important points
> http://mesos.apache.org/documentation/latest/high-
> availability-framework-guide/ (If an agent does not reregister with the
> new master within a timeout (controlled by the --agent_reregister_timeout
> configuration flag), the master marks the agent as failed and follows the
> same steps described above. However, there is one difference: by default,
> agents are allowed to reconnect following master failover, even after the
> agent_reregister_timeout has fired. This means that frameworks might see a
> TASK_LOST update for a task but then later discover that the task is
> running (because the agent where it was running was allowed to reconnect).
> http://mesos.apache.org/documentation/latest/reconciliation/(Implicit
> reconciliation (passing an empty list) should also be used periodically, as
> a defense against data loss in the framework. Unless a strict registry is
> in use on the master, its possible for tasks to resurrect from a LOST state
> (without a strict registry the master does not enforce agent removal across
> failovers). When an unknown task is encountered, the scheduler should kill
> or recover the task.)
> https://issues.apache.org/jira/browse/MESOS-5951(Removes strict registry
> mode flag from 1.1 and reverts to the old behavior of non strict registry
> mode where tasks and executors were not killed on agent reregistration
> timeout on master failover)
> So, what we find, if the slave does not come back after 10 mins
> 1. Mesos master sends slave lost but not task lost to Aurora.2. Aurora
> does not replace the tasks.3. When explicit recon starts , then only this
> gets corrected with aurora spawning replacement tasks.
> If slave restarts after 10 mins
> 1. When implicit recon starts, this situation gets fixed because in aurora
> it is marked as lost and mesos sends running and those get killed and
> replaced.
> So, questions
> 1. When mesos sends slave lost after 10 mins in this situation , why does
> aurora not act on it?2. As per recon docs best practices, explicit recon
> should start followed by implicit recon on master failover. Looks like
> aurora is not doing that and the regular hourly recons are running with 30
> min spread between explicit and implicit. Should aurora do recon on master
> fail over?
>
> General questions1. What is the effect on aurora if we make explicit recon
> every 15 mins instead of default 1 hr? Does it slow down scheduling, does
> snapshot creation gets delayed etc?
> 2. Any issue if spread between explicit recon and implicit recon brought
> down to 2 mins from 30 mins? probably depend on 1.
> Thx
>
>
>

Re: Aurora reconciliation and Master fail over

Reply via email to