Re: Aurora reconciliation and Master fail over

Meghdoot bhattacharya Fri, 14 Jul 2017 10:05:24 -0700

So in this situation why is not aurora replacing the tasks and waiting for 
external recon to fix it.


This is different when the 75 sec (5*15) health check of slave times out (no 
master failover), aurora replaces it on task lost message.

Are you hinting we should ask mesos folks why in master fail over 
reregistration timeout scenario why task lost not sent though slave lost sent 
and from below docs task lost should have been sent.

Because either mesos is not sending the right status or aurora is not handling 
it.

Thx

> On Jul 14, 2017, at 8:21 AM, David McLaughlin <dmclaugh...@apache.org> wrote:
> 
> "1. When mesos sends slave lost after 10 mins in this situation , why does
> aurora not act on it?"
> 
> Because Mesos also sends TASK_LOST for every task running on the agent
> whenever it calls slaveLost:
> 
> When it is time to remove an agent, the master removes the agent from the
> list of registered agents in the master’s durable state
> <http://mesos.apache.org/documentation/latest/replicated-log-internals/> (this
> will survive master failover). The master sends a slaveLost callback to
> every registered scheduler driver; it also sends TASK_LOST status updates
> for every task that was running on the removed agent.
> 
> 
> 
> 
> On Thu, Jul 13, 2017 at 4:32 PM, meghdoot bhattacharya <
> meghdoo...@yahoo.com.invalid> wrote:
> 
>> We were investigation slave re registration behavior on master fail over
>> in Aurora 0.17 with mesos 1.1.
>> Few important points
>> http://mesos.apache.org/documentation/latest/high-
>> availability-framework-guide/ (If an agent does not reregister with the
>> new master within a timeout (controlled by the --agent_reregister_timeout
>> configuration flag), the master marks the agent as failed and follows the
>> same steps described above. However, there is one difference: by default,
>> agents are allowed to reconnect following master failover, even after the
>> agent_reregister_timeout has fired. This means that frameworks might see a
>> TASK_LOST update for a task but then later discover that the task is
>> running (because the agent where it was running was allowed to reconnect).
>> http://mesos.apache.org/documentation/latest/reconciliation/(Implicit
>> reconciliation (passing an empty list) should also be used periodically, as
>> a defense against data loss in the framework. Unless a strict registry is
>> in use on the master, its possible for tasks to resurrect from a LOST state
>> (without a strict registry the master does not enforce agent removal across
>> failovers). When an unknown task is encountered, the scheduler should kill
>> or recover the task.)
>> https://issues.apache.org/jira/browse/MESOS-5951(Removes strict registry
>> mode flag from 1.1 and reverts to the old behavior of non strict registry
>> mode where tasks and executors were not killed on agent reregistration
>> timeout on master failover)
>> So, what we find, if the slave does not come back after 10 mins
>> 1. Mesos master sends slave lost but not task lost to Aurora.2. Aurora
>> does not replace the tasks.3. When explicit recon starts , then only this
>> gets corrected with aurora spawning replacement tasks.
>> If slave restarts after 10 mins
>> 1. When implicit recon starts, this situation gets fixed because in aurora
>> it is marked as lost and mesos sends running and those get killed and
>> replaced.
>> So, questions
>> 1. When mesos sends slave lost after 10 mins in this situation , why does
>> aurora not act on it?2. As per recon docs best practices, explicit recon
>> should start followed by implicit recon on master failover. Looks like
>> aurora is not doing that and the regular hourly recons are running with 30
>> min spread between explicit and implicit. Should aurora do recon on master
>> fail over?
>> 
>> General questions1. What is the effect on aurora if we make explicit recon
>> every 15 mins instead of default 1 hr? Does it slow down scheduling, does
>> snapshot creation gets delayed etc?
>> 2. Any issue if spread between explicit recon and implicit recon brought
>> down to 2 mins from 30 mins? probably depend on 1.
>> Thx
>> 
>> 
>>

Re: Aurora reconciliation and Master fail over

Reply via email to