Re: Aurora reconciliation and Master fail over

David McLaughlin Fri, 14 Jul 2017 10:28:31 -0700

It would be interesting to see the logs. I think that will tell you if the
Mesos master is:


a) Sending slaveLost
b) Trying to send TASK_LOST

And then the Scheduler logs (and/or the metrics it exports) should tell you
whether those events were received. If this is reproducible, I'd consider
it a serious bug.

On Fri, Jul 14, 2017 at 10:04 AM, Meghdoot bhattacharya <
[email protected]> wrote:

> So in this situation why is not aurora replacing the tasks and waiting for
> external recon to fix it.
>
> This is different when the 75 sec (5*15) health check of slave times out
> (no master failover), aurora replaces it on task lost message.
>
> Are you hinting we should ask mesos folks why in master fail over
> reregistration timeout scenario why task lost not sent though slave lost
> sent and from below docs task lost should have been sent.
>
> Because either mesos is not sending the right status or aurora is not
> handling it.
>
> Thx
>
> > On Jul 14, 2017, at 8:21 AM, David McLaughlin <[email protected]>
> wrote:
> >
> > "1. When mesos sends slave lost after 10 mins in this situation , why
> does
> > aurora not act on it?"
> >
> > Because Mesos also sends TASK_LOST for every task running on the agent
> > whenever it calls slaveLost:
> >
> > When it is time to remove an agent, the master removes the agent from the
> > list of registered agents in the master’s durable state
> > <http://mesos.apache.org/documentation/latest/replicated-log-internals/>
> (this
> > will survive master failover). The master sends a slaveLost callback to
> > every registered scheduler driver; it also sends TASK_LOST status updates
> > for every task that was running on the removed agent.
> >
> >
> >
> >
> > On Thu, Jul 13, 2017 at 4:32 PM, meghdoot bhattacharya <
> > [email protected]> wrote:
> >
> >> We were investigation slave re registration behavior on master fail over
> >> in Aurora 0.17 with mesos 1.1.
> >> Few important points
> >> http://mesos.apache.org/documentation/latest/high-
> >> availability-framework-guide/ (If an agent does not reregister with the
> >> new master within a timeout (controlled by the
> --agent_reregister_timeout
> >> configuration flag), the master marks the agent as failed and follows
> the
> >> same steps described above. However, there is one difference: by
> default,
> >> agents are allowed to reconnect following master failover, even after
> the
> >> agent_reregister_timeout has fired. This means that frameworks might
> see a
> >> TASK_LOST update for a task but then later discover that the task is
> >> running (because the agent where it was running was allowed to
> reconnect).
> >> http://mesos.apache.org/documentation/latest/reconciliation/(Implicit
> >> reconciliation (passing an empty list) should also be used
> periodically, as
> >> a defense against data loss in the framework. Unless a strict registry
> is
> >> in use on the master, its possible for tasks to resurrect from a LOST
> state
> >> (without a strict registry the master does not enforce agent removal
> across
> >> failovers). When an unknown task is encountered, the scheduler should
> kill
> >> or recover the task.)
> >> https://issues.apache.org/jira/browse/MESOS-5951(Removes strict
> registry
> >> mode flag from 1.1 and reverts to the old behavior of non strict
> registry
> >> mode where tasks and executors were not killed on agent reregistration
> >> timeout on master failover)
> >> So, what we find, if the slave does not come back after 10 mins
> >> 1. Mesos master sends slave lost but not task lost to Aurora.2. Aurora
> >> does not replace the tasks.3. When explicit recon starts , then only
> this
> >> gets corrected with aurora spawning replacement tasks.
> >> If slave restarts after 10 mins
> >> 1. When implicit recon starts, this situation gets fixed because in
> aurora
> >> it is marked as lost and mesos sends running and those get killed and
> >> replaced.
> >> So, questions
> >> 1. When mesos sends slave lost after 10 mins in this situation , why
> does
> >> aurora not act on it?2. As per recon docs best practices, explicit recon
> >> should start followed by implicit recon on master failover. Looks like
> >> aurora is not doing that and the regular hourly recons are running with
> 30
> >> min spread between explicit and implicit. Should aurora do recon on
> master
> >> fail over?
> >>
> >> General questions1. What is the effect on aurora if we make explicit
> recon
> >> every 15 mins instead of default 1 hr? Does it slow down scheduling,
> does
> >> snapshot creation gets delayed etc?
> >> 2. Any issue if spread between explicit recon and implicit recon brought
> >> down to 2 mins from 30 mins? probably depend on 1.
> >> Thx
> >>
> >>
> >>
>
>

Re: Aurora reconciliation and Master fail over

Reply via email to