Re: Aurora reconciliation and Master fail over

Renan DelValle Tue, 18 Jul 2017 10:46:00 -0700

Yup, that looks like the way to go. Going to go ahead and file a ticket on
JIRA for this so that we don't forget. Thanks for digging into this David.


-Renan

On Mon, Jul 17, 2017 at 3:00 PM, David McLaughlin <da...@dmclaughlin.com>
wrote:

> Based on the thread in the Mesos dev list, it looks like because they
> don't persist task information so they don't have the task IDs to send when
> they detect the agent is lost during failover. So unless this is changed on
> the Mesos side, we need to act on the slaveLost message and mark all those
> tasks as LOST in Aurora.
>
> Or rely on reconciliation. To reconcile more often, you should keep in
> mind:
>
> 1) Implicit reconciliation sends one message to Mesos and Mesos replies
> with N number of status updates immediately, where N = number of running
> tasks. This process is usually quick (on the order of seconds) due to being
> mostly NOOP status updates. When you have a large number of running tasks
> (say 100k+), you may see some GC pressure due to the flood of status
> updates. If this operation overlapped with another particularly expensive
> operation (like a snapshot) it can cause a huge stop the world GC. But it
> does not otherwise interfere with any operation.
>
> 2) Explicit reconciliation is done in batches, where Aurora batches up all
> running tasks and sends one batch at a time, staggered by some delay. The
> benefit here is there is less GC pressure, but the drawback is if you have
> a lot of running tasks (again, 100k+), it will take over 10 minutes to
> complete. So you have to make sure your reconciliation interval is aligned
> with this (you can always increase the batch size to make this happen
> faster).
>
> Cheers,
> David
>
> On Sun, Jul 16, 2017 at 11:10 AM, Meghdoot bhattacharya <
> meghdoo...@yahoo.com.invalid> wrote:
>
>> Got it. Thx!
>>
>> > On Jul 16, 2017, at 9:49 AM, Stephan Erb <m...@stephanerb.eu> wrote:
>> >
>> > Reconciliation in Aurora is not a specific mode. It just runs
>> > concurrently to other background work such as snapshots or backups [1].
>> >
>> >
>> > Just be aware that we don't have metrics to track the runtime of
>> > explicit and implicit reconciliations. If you use settings that are
>> > overly aggressive, you might overload Auroras queue of incoming Mesos
>> > status updates (for example).
>> >
>> > [1] https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d5
>> > 7069adda434/src/main/java/org/apache/aurora/scheduler/reconciliation/Ta
>> > skReconciler.java
>> >
>> >
>> >> On Sat, 2017-07-15 at 22:28 -0700, Meghdoot bhattacharya wrote:
>> >> Thx David for the follow up and confirmation.
>> >> We have started the thread on the mesos dev DL.
>> >>
>> >> So to get clarification on the recon, what is in general effect
>> >> during the recon. Does scheduling and activities like snapshot is
>> >> paused as recon takes place. Trying to see whether to run aggressive
>> >> recon in mean time.
>> >>
>> >> Thx
>> >>
>> >>> On Jul 15, 2017, at 9:33 AM, David McLaughlin <dmclaughlin@apache.o
>> >>> rg> wrote:
>> >>>
>> >>> I've left a comment on the initial RB detailing how the change
>> >>> broke
>> >>> backwards-compatibility. Given that the tasks are marked as lost as
>> >>> soon as
>> >>> the agent reregisters after slaveLost is sent anyway, there doesn't
>> >>> seem to
>> >>> be any reason not to send TASK_LOST too. I think this should be an
>> >>> easy
>> >>> fix.
>> >>>
>> >>> On Sat, Jul 15, 2017 at 9:21 AM, David McLaughlin <dmclaughlin@apac
>> >>> he.org>
>> >>> wrote:
>> >>>
>> >>>> Yes, we've confirmed this internally too (Santhosh did the work
>> >>>> here):
>> >>>>
>> >>>> When an agent becomes unreachable while the master is running, it
>> >>>> sends
>> >>>>> TASK_LOST events for each task on the agent.
>> >>>>> https://github.com/apache/mesos/blob/33093c893773f8c9d293afe
>> >>>>> 38e9909f9a2868d32/src/master/master.cpp#L7066-L7107
>> >>>>> Marking agent unreachable after failover does not cause
>> >>>>> TASK_LOST events.
>> >>>>> https://github.com/apache/mesos/blob/33093c893773f8c9d293afe
>> >>>>> 38e9909f9a2868d32/src/master/master.cpp#L2036-L2070
>> >>>>> Once an agent re-registers it sends TASK_LOST events. Agent
>> >>>>> sending
>> >>>>> TASK_LOST for tasks that it does not know after a Master
>> >>>>> failover.
>> >>>>> https://github.com/apache/mesos/blob/33093c893773f8c9d293afe
>> >>>>> 38e9909f9a2868d32/src/slave/slave.cpp#L1324-L1383
>> >>>>
>> >>>>
>> >>>>
>> >>>> The separate code path for markUnreachableAfterFailover appears
>> >>>> to have
>> >>>> been added by this commit:
>> >>>> https://github.com/apache/mesos/commit/937c85f2f6528d1ac56ea9a7aa
>> >>>> 174c
>> >>>> a0bd371d0c
>> >>>>
>> >>>> And I think this totally breaks the promise of introducing the
>> >>>> PARTITION_AWARE stuff in a backwards-compatible way.
>> >>>>
>> >>>> So right now, yes we rely on reconciliation to finally mark the
>> >>>> tasks as
>> >>>> LOST and reschedule their replacements.
>> >>>>
>> >>>> I think the only reason we haven't been more impacted by this at
>> >>>> Twitter
>> >>>> is our Mesos master is remarkably stable (compared to Aurora's
>> >>>> daily
>> >>>> failovers).
>> >>>>
>> >>>> We have two paths forward here: push forward and embrace the new
>> >>>> partition
>> >>>> awareness features in Aurora and/or push back on the above change
>> >>>> with the
>> >>>> Mesos community and have a better story for non-partition aware
>> >>>> APIs in the
>> >>>> short term.
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Sat, Jul 15, 2017 at 2:01 AM, Meghdoot bhattacharya <
>> >>>> meghdoo...@yahoo.com.invalid> wrote:
>> >>>>
>> >>>>> We can reproduce it easily as the steps are
>> >>>>> 1. Shut down leading mesos master
>> >>>>> 2. Shutdown agent at same time
>> >>>>> 3. Wait for 10 mins
>> >>>>>
>> >>>>> What Renan and I saw in the logs were only agent lost and not
>> >>>>> task lost
>> >>>>> sent. While in regular health check expire scenario both task
>> >>>>> lost and
>> >>>>> agent lost were sent.
>> >>>>>
>> >>>>> So yes this is very concerning.
>> >>>>>
>> >>>>> Thx
>> >>>>>
>> >>>>>> On Jul 14, 2017, at 10:28 AM, David McLaughlin <dmclaughlin@a
>> >>>>>> pache.org>
>> >>>>>
>> >>>>> wrote:
>> >>>>>>
>> >>>>>> It would be interesting to see the logs. I think that will
>> >>>>>> tell you if
>> >>>>>
>> >>>>> the
>> >>>>>> Mesos master is:
>> >>>>>>
>> >>>>>> a) Sending slaveLost
>> >>>>>> b) Trying to send TASK_LOST
>> >>>>>>
>> >>>>>> And then the Scheduler logs (and/or the metrics it exports)
>> >>>>>> should tell
>> >>>>>
>> >>>>> you
>> >>>>>> whether those events were received. If this is reproducible,
>> >>>>>> I'd
>> >>>>>
>> >>>>> consider
>> >>>>>> it a serious bug.
>> >>>>>>
>> >>>>>> On Fri, Jul 14, 2017 at 10:04 AM, Meghdoot bhattacharya <
>> >>>>>> meghdoo...@yahoo.com.invalid> wrote:
>> >>>>>>
>> >>>>>>> So in this situation why is not aurora replacing the tasks
>> >>>>>>> and waiting
>> >>>>>
>> >>>>> for
>> >>>>>>> external recon to fix it.
>> >>>>>>>
>> >>>>>>> This is different when the 75 sec (5*15) health check of
>> >>>>>>> slave times
>> >>>>>
>> >>>>> out
>> >>>>>>> (no master failover), aurora replaces it on task lost
>> >>>>>>> message.
>> >>>>>>>
>> >>>>>>> Are you hinting we should ask mesos folks why in master
>> >>>>>>> fail over
>> >>>>>>> reregistration timeout scenario why task lost not sent
>> >>>>>>> though slave
>> >>>>>
>> >>>>> lost
>> >>>>>>> sent and from below docs task lost should have been sent.
>> >>>>>>>
>> >>>>>>> Because either mesos is not sending the right status or
>> >>>>>>> aurora is not
>> >>>>>>> handling it.
>> >>>>>>>
>> >>>>>>> Thx
>> >>>>>>>
>> >>>>>>>> On Jul 14, 2017, at 8:21 AM, David McLaughlin <dmclaughli
>> >>>>>>>> n...@apache.org
>> >>>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>> "1. When mesos sends slave lost after 10 mins in this
>> >>>>>>>> situation , why
>> >>>>>>>
>> >>>>>>> does
>> >>>>>>>> aurora not act on it?"
>> >>>>>>>>
>> >>>>>>>> Because Mesos also sends TASK_LOST for every task running
>> >>>>>>>> on the agent
>> >>>>>>>> whenever it calls slaveLost:
>> >>>>>>>>
>> >>>>>>>> When it is time to remove an agent, the master removes
>> >>>>>>>> the agent from
>> >>>>>
>> >>>>> the
>> >>>>>>>> list of registered agents in the master’s durable state
>> >>>>>>>> <http://mesos.apache.org/documentation/latest/replicated-
>> >>>>>
>> >>>>> log-internals/>
>> >>>>>>> (this
>> >>>>>>>> will survive master failover). The master sends a
>> >>>>>>>> slaveLost callback
>> >>>>>
>> >>>>> to
>> >>>>>>>> every registered scheduler driver; it also sends
>> >>>>>>>> TASK_LOST status
>> >>>>>
>> >>>>> updates
>> >>>>>>>> for every task that was running on the removed agent.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Thu, Jul 13, 2017 at 4:32 PM, meghdoot bhattacharya <
>> >>>>>>>> meghdoo...@yahoo.com.invalid> wrote:
>> >>>>>>>>
>> >>>>>>>>> We were investigation slave re registration behavior on
>> >>>>>>>>> master fail
>> >>>>>
>> >>>>> over
>> >>>>>>>>> in Aurora 0.17 with mesos 1.1.
>> >>>>>>>>> Few important points
>> >>>>>>>>> http://mesos.apache.org/documentation/latest/high-
>> >>>>>>>>> availability-framework-guide/ (If an agent does not
>> >>>>>>>>> reregister with
>> >>>>>
>> >>>>> the
>> >>>>>>>>> new master within a timeout (controlled by the
>> >>>>>>>
>> >>>>>>> --agent_reregister_timeout
>> >>>>>>>>> configuration flag), the master marks the agent as
>> >>>>>>>>> failed and follows
>> >>>>>>>
>> >>>>>>> the
>> >>>>>>>>> same steps described above. However, there is one
>> >>>>>>>>> difference: by
>> >>>>>>>
>> >>>>>>> default,
>> >>>>>>>>> agents are allowed to reconnect following master
>> >>>>>>>>> failover, even after
>> >>>>>>>
>> >>>>>>> the
>> >>>>>>>>> agent_reregister_timeout has fired. This means that
>> >>>>>>>>> frameworks might
>> >>>>>>>
>> >>>>>>> see a
>> >>>>>>>>> TASK_LOST update for a task but then later discover
>> >>>>>>>>> that the task is
>> >>>>>>>>> running (because the agent where it was running was
>> >>>>>>>>> allowed to
>> >>>>>>>
>> >>>>>>> reconnect).
>> >>>>>>>>> http://mesos.apache.org/documentation/latest/reconcilia
>> >>>>>>>>> tion/
>> >>>>>
>> >>>>> (Implicit
>> >>>>>>>>> reconciliation (passing an empty list) should also be
>> >>>>>>>>> used
>> >>>>>>>
>> >>>>>>> periodically, as
>> >>>>>>>>> a defense against data loss in the framework. Unless a
>> >>>>>>>>> strict
>> >>>>>
>> >>>>> registry
>> >>>>>>> is
>> >>>>>>>>> in use on the master, its possible for tasks to
>> >>>>>>>>> resurrect from a LOST
>> >>>>>>>
>> >>>>>>> state
>> >>>>>>>>> (without a strict registry the master does not enforce
>> >>>>>>>>> agent removal
>> >>>>>>>
>> >>>>>>> across
>> >>>>>>>>> failovers). When an unknown task is encountered, the
>> >>>>>>>>> scheduler should
>> >>>>>>>
>> >>>>>>> kill
>> >>>>>>>>> or recover the task.)
>> >>>>>>>>> https://issues.apache.org/jira/browse/MESOS-5951(Remove
>> >>>>>>>>> s strict
>> >>>>>>>
>> >>>>>>> registry
>> >>>>>>>>> mode flag from 1.1 and reverts to the old behavior of
>> >>>>>>>>> non strict
>> >>>>>>>
>> >>>>>>> registry
>> >>>>>>>>> mode where tasks and executors were not killed on agent
>> >>>>>
>> >>>>> reregistration
>> >>>>>>>>> timeout on master failover)
>> >>>>>>>>> So, what we find, if the slave does not come back after
>> >>>>>>>>> 10 mins
>> >>>>>>>>> 1. Mesos master sends slave lost but not task lost to
>> >>>>>>>>> Aurora.2.
>> >>>>>
>> >>>>> Aurora
>> >>>>>>>>> does not replace the tasks.3. When explicit recon
>> >>>>>>>>> starts , then only
>> >>>>>>>
>> >>>>>>> this
>> >>>>>>>>> gets corrected with aurora spawning replacement tasks.
>> >>>>>>>>> If slave restarts after 10 mins
>> >>>>>>>>> 1. When implicit recon starts, this situation gets
>> >>>>>>>>> fixed because in
>> >>>>>>>
>> >>>>>>> aurora
>> >>>>>>>>> it is marked as lost and mesos sends running and those
>> >>>>>>>>> get killed and
>> >>>>>>>>> replaced.
>> >>>>>>>>> So, questions
>> >>>>>>>>> 1. When mesos sends slave lost after 10 mins in this
>> >>>>>>>>> situation , why
>> >>>>>>>
>> >>>>>>> does
>> >>>>>>>>> aurora not act on it?2. As per recon docs best
>> >>>>>>>>> practices, explicit
>> >>>>>
>> >>>>> recon
>> >>>>>>>>> should start followed by implicit recon on master
>> >>>>>>>>> failover. Looks
>> >>>>>
>> >>>>> like
>> >>>>>>>>> aurora is not doing that and the regular hourly recons
>> >>>>>>>>> are running
>> >>>>>
>> >>>>> with
>> >>>>>>> 30
>> >>>>>>>>> min spread between explicit and implicit. Should aurora
>> >>>>>>>>> do recon on
>> >>>>>>>
>> >>>>>>> master
>> >>>>>>>>> fail over?
>> >>>>>>>>>
>> >>>>>>>>> General questions1. What is the effect on aurora if we
>> >>>>>>>>> make explicit
>> >>>>>>>
>> >>>>>>> recon
>> >>>>>>>>> every 15 mins instead of default 1 hr? Does it slow
>> >>>>>>>>> down scheduling,
>> >>>>>>>
>> >>>>>>> does
>> >>>>>>>>> snapshot creation gets delayed etc?
>> >>>>>>>>> 2. Any issue if spread between explicit recon and
>> >>>>>>>>> implicit recon
>> >>>>>
>> >>>>> brought
>> >>>>>>>>> down to 2 mins from 30 mins? probably depend on 1.
>> >>>>>>>>> Thx
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>>>
>> >>
>> >>
>>
>>
>

Re: Aurora reconciliation and Master fail over

Reply via email to