Re: Aurora reconciliation and Master fail over

David McLaughlin Sat, 15 Jul 2017 09:33:53 -0700

I've left a comment on the initial RB detailing how the change broke
backwards-compatibility. Given that the tasks are marked as lost as soon as
the agent reregisters after slaveLost is sent anyway, there doesn't seem to
be any reason not to send TASK_LOST too. I think this should be an easy
fix.


On Sat, Jul 15, 2017 at 9:21 AM, David McLaughlin <dmclaugh...@apache.org>
wrote:

> Yes, we've confirmed this internally too (Santhosh did the work here):
>
> When an agent becomes unreachable while the master is running, it sends
>> TASK_LOST events for each task on the agent.
>> https://github.com/apache/mesos/blob/33093c893773f8c9d293afe
>> 38e9909f9a2868d32/src/master/master.cpp#L7066-L7107
>> Marking agent unreachable after failover does not cause TASK_LOST events.
>> https://github.com/apache/mesos/blob/33093c893773f8c9d293afe
>> 38e9909f9a2868d32/src/master/master.cpp#L2036-L2070
>> Once an agent re-registers it sends TASK_LOST events. Agent sending
>> TASK_LOST for tasks that it does not know after a Master failover.
>> https://github.com/apache/mesos/blob/33093c893773f8c9d293afe
>> 38e9909f9a2868d32/src/slave/slave.cpp#L1324-L1383
>
>
>
> The separate code path for markUnreachableAfterFailover appears to have
> been added by this commit:
> https://github.com/apache/mesos/commit/937c85f2f6528d1ac56ea9a7aa174c
> a0bd371d0c
>
> And I think this totally breaks the promise of introducing the
> PARTITION_AWARE stuff in a backwards-compatible way.
>
> So right now, yes we rely on reconciliation to finally mark the tasks as
> LOST and reschedule their replacements.
>
> I think the only reason we haven't been more impacted by this at Twitter
> is our Mesos master is remarkably stable (compared to Aurora's daily
> failovers).
>
> We have two paths forward here: push forward and embrace the new partition
> awareness features in Aurora and/or push back on the above change with the
> Mesos community and have a better story for non-partition aware APIs in the
> short term.
>
>
>
> On Sat, Jul 15, 2017 at 2:01 AM, Meghdoot bhattacharya <
> meghdoo...@yahoo.com.invalid> wrote:
>
>> We can reproduce it easily as the steps are
>> 1. Shut down leading mesos master
>> 2. Shutdown agent at same time
>> 3. Wait for 10 mins
>>
>> What Renan and I saw in the logs were only agent lost and not task lost
>> sent. While in regular health check expire scenario both task lost and
>> agent lost were sent.
>>
>> So yes this is very concerning.
>>
>> Thx
>>
>> > On Jul 14, 2017, at 10:28 AM, David McLaughlin <dmclaugh...@apache.org>
>> wrote:
>> >
>> > It would be interesting to see the logs. I think that will tell you if
>> the
>> > Mesos master is:
>> >
>> > a) Sending slaveLost
>> > b) Trying to send TASK_LOST
>> >
>> > And then the Scheduler logs (and/or the metrics it exports) should tell
>> you
>> > whether those events were received. If this is reproducible, I'd
>> consider
>> > it a serious bug.
>> >
>> > On Fri, Jul 14, 2017 at 10:04 AM, Meghdoot bhattacharya <
>> > meghdoo...@yahoo.com.invalid> wrote:
>> >
>> >> So in this situation why is not aurora replacing the tasks and waiting
>> for
>> >> external recon to fix it.
>> >>
>> >> This is different when the 75 sec (5*15) health check of slave times
>> out
>> >> (no master failover), aurora replaces it on task lost message.
>> >>
>> >> Are you hinting we should ask mesos folks why in master fail over
>> >> reregistration timeout scenario why task lost not sent though slave
>> lost
>> >> sent and from below docs task lost should have been sent.
>> >>
>> >> Because either mesos is not sending the right status or aurora is not
>> >> handling it.
>> >>
>> >> Thx
>> >>
>> >>> On Jul 14, 2017, at 8:21 AM, David McLaughlin <dmclaugh...@apache.org
>> >
>> >> wrote:
>> >>>
>> >>> "1. When mesos sends slave lost after 10 mins in this situation , why
>> >> does
>> >>> aurora not act on it?"
>> >>>
>> >>> Because Mesos also sends TASK_LOST for every task running on the agent
>> >>> whenever it calls slaveLost:
>> >>>
>> >>> When it is time to remove an agent, the master removes the agent from
>> the
>> >>> list of registered agents in the master’s durable state
>> >>> <http://mesos.apache.org/documentation/latest/replicated-
>> log-internals/>
>> >> (this
>> >>> will survive master failover). The master sends a slaveLost callback
>> to
>> >>> every registered scheduler driver; it also sends TASK_LOST status
>> updates
>> >>> for every task that was running on the removed agent.
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Thu, Jul 13, 2017 at 4:32 PM, meghdoot bhattacharya <
>> >>> meghdoo...@yahoo.com.invalid> wrote:
>> >>>
>> >>>> We were investigation slave re registration behavior on master fail
>> over
>> >>>> in Aurora 0.17 with mesos 1.1.
>> >>>> Few important points
>> >>>> http://mesos.apache.org/documentation/latest/high-
>> >>>> availability-framework-guide/ (If an agent does not reregister with
>> the
>> >>>> new master within a timeout (controlled by the
>> >> --agent_reregister_timeout
>> >>>> configuration flag), the master marks the agent as failed and follows
>> >> the
>> >>>> same steps described above. However, there is one difference: by
>> >> default,
>> >>>> agents are allowed to reconnect following master failover, even after
>> >> the
>> >>>> agent_reregister_timeout has fired. This means that frameworks might
>> >> see a
>> >>>> TASK_LOST update for a task but then later discover that the task is
>> >>>> running (because the agent where it was running was allowed to
>> >> reconnect).
>> >>>> http://mesos.apache.org/documentation/latest/reconciliation/
>> (Implicit
>> >>>> reconciliation (passing an empty list) should also be used
>> >> periodically, as
>> >>>> a defense against data loss in the framework. Unless a strict
>> registry
>> >> is
>> >>>> in use on the master, its possible for tasks to resurrect from a LOST
>> >> state
>> >>>> (without a strict registry the master does not enforce agent removal
>> >> across
>> >>>> failovers). When an unknown task is encountered, the scheduler should
>> >> kill
>> >>>> or recover the task.)
>> >>>> https://issues.apache.org/jira/browse/MESOS-5951(Removes strict
>> >> registry
>> >>>> mode flag from 1.1 and reverts to the old behavior of non strict
>> >> registry
>> >>>> mode where tasks and executors were not killed on agent
>> reregistration
>> >>>> timeout on master failover)
>> >>>> So, what we find, if the slave does not come back after 10 mins
>> >>>> 1. Mesos master sends slave lost but not task lost to Aurora.2.
>> Aurora
>> >>>> does not replace the tasks.3. When explicit recon starts , then only
>> >> this
>> >>>> gets corrected with aurora spawning replacement tasks.
>> >>>> If slave restarts after 10 mins
>> >>>> 1. When implicit recon starts, this situation gets fixed because in
>> >> aurora
>> >>>> it is marked as lost and mesos sends running and those get killed and
>> >>>> replaced.
>> >>>> So, questions
>> >>>> 1. When mesos sends slave lost after 10 mins in this situation , why
>> >> does
>> >>>> aurora not act on it?2. As per recon docs best practices, explicit
>> recon
>> >>>> should start followed by implicit recon on master failover. Looks
>> like
>> >>>> aurora is not doing that and the regular hourly recons are running
>> with
>> >> 30
>> >>>> min spread between explicit and implicit. Should aurora do recon on
>> >> master
>> >>>> fail over?
>> >>>>
>> >>>> General questions1. What is the effect on aurora if we make explicit
>> >> recon
>> >>>> every 15 mins instead of default 1 hr? Does it slow down scheduling,
>> >> does
>> >>>> snapshot creation gets delayed etc?
>> >>>> 2. Any issue if spread between explicit recon and implicit recon
>> brought
>> >>>> down to 2 mins from 30 mins? probably depend on 1.
>> >>>> Thx
>> >>>>
>> >>>>
>> >>>>
>> >>
>> >>
>>
>>
>

Re: Aurora reconciliation and Master fail over

Reply via email to