Re: Aurora reconciliation and Master fail over

Meghdoot bhattacharya Sun, 16 Jul 2017 11:11:05 -0700
Got it. Thx!

> On Jul 16, 2017, at 9:49 AM, Stephan Erb <m...@stephanerb.eu> wrote:
> 
> Reconciliation in Aurora is not a specific mode. It just runs
> concurrently to other background work such as snapshots or backups [1].
> 
> 
> Just be aware that we don't have metrics to track the runtime of
> explicit and implicit reconciliations. If you use settings that are
> overly aggressive, you might overload Auroras queue of incoming Mesos
> status updates (for example). 
> 
> [1] https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d5
> 7069adda434/src/main/java/org/apache/aurora/scheduler/reconciliation/Ta
> skReconciler.java 
> 
> 
>> On Sat, 2017-07-15 at 22:28 -0700, Meghdoot bhattacharya wrote:
>> Thx David for the follow up and confirmation.
>> We have started the thread on the mesos dev DL.
>> 
>> So to get clarification on the recon, what is in general effect
>> during the recon. Does scheduling and activities like snapshot is
>> paused as recon takes place. Trying to see whether to run aggressive
>> recon in mean time.
>> 
>> Thx
>> 
>>> On Jul 15, 2017, at 9:33 AM, David McLaughlin <dmclaughlin@apache.o
>>> rg> wrote:
>>> 
>>> I've left a comment on the initial RB detailing how the change
>>> broke
>>> backwards-compatibility. Given that the tasks are marked as lost as
>>> soon as
>>> the agent reregisters after slaveLost is sent anyway, there doesn't
>>> seem to
>>> be any reason not to send TASK_LOST too. I think this should be an
>>> easy
>>> fix.
>>> 
>>> On Sat, Jul 15, 2017 at 9:21 AM, David McLaughlin <dmclaughlin@apac
>>> he.org>
>>> wrote:
>>> 
>>>> Yes, we've confirmed this internally too (Santhosh did the work
>>>> here):
>>>> 
>>>> When an agent becomes unreachable while the master is running, it
>>>> sends
>>>>> TASK_LOST events for each task on the agent.
>>>>> https://github.com/apache/mesos/blob/33093c893773f8c9d293afe
>>>>> 38e9909f9a2868d32/src/master/master.cpp#L7066-L7107
>>>>> Marking agent unreachable after failover does not cause
>>>>> TASK_LOST events.
>>>>> https://github.com/apache/mesos/blob/33093c893773f8c9d293afe
>>>>> 38e9909f9a2868d32/src/master/master.cpp#L2036-L2070
>>>>> Once an agent re-registers it sends TASK_LOST events. Agent
>>>>> sending
>>>>> TASK_LOST for tasks that it does not know after a Master
>>>>> failover.
>>>>> https://github.com/apache/mesos/blob/33093c893773f8c9d293afe
>>>>> 38e9909f9a2868d32/src/slave/slave.cpp#L1324-L1383
>>>> 
>>>> 
>>>> 
>>>> The separate code path for markUnreachableAfterFailover appears
>>>> to have
>>>> been added by this commit:
>>>> https://github.com/apache/mesos/commit/937c85f2f6528d1ac56ea9a7aa
>>>> 174c
>>>> a0bd371d0c
>>>> 
>>>> And I think this totally breaks the promise of introducing the
>>>> PARTITION_AWARE stuff in a backwards-compatible way.
>>>> 
>>>> So right now, yes we rely on reconciliation to finally mark the
>>>> tasks as
>>>> LOST and reschedule their replacements.
>>>> 
>>>> I think the only reason we haven't been more impacted by this at
>>>> Twitter
>>>> is our Mesos master is remarkably stable (compared to Aurora's
>>>> daily
>>>> failovers).
>>>> 
>>>> We have two paths forward here: push forward and embrace the new
>>>> partition
>>>> awareness features in Aurora and/or push back on the above change
>>>> with the
>>>> Mesos community and have a better story for non-partition aware
>>>> APIs in the
>>>> short term.
>>>> 
>>>> 
>>>> 
>>>> On Sat, Jul 15, 2017 at 2:01 AM, Meghdoot bhattacharya <
>>>> meghdoo...@yahoo.com.invalid> wrote:
>>>> 
>>>>> We can reproduce it easily as the steps are
>>>>> 1. Shut down leading mesos master
>>>>> 2. Shutdown agent at same time
>>>>> 3. Wait for 10 mins
>>>>> 
>>>>> What Renan and I saw in the logs were only agent lost and not
>>>>> task lost
>>>>> sent. While in regular health check expire scenario both task
>>>>> lost and
>>>>> agent lost were sent.
>>>>> 
>>>>> So yes this is very concerning.
>>>>> 
>>>>> Thx
>>>>> 
>>>>>> On Jul 14, 2017, at 10:28 AM, David McLaughlin <dmclaughlin@a
>>>>>> pache.org>
>>>>> 
>>>>> wrote:
>>>>>> 
>>>>>> It would be interesting to see the logs. I think that will
>>>>>> tell you if
>>>>> 
>>>>> the
>>>>>> Mesos master is:
>>>>>> 
>>>>>> a) Sending slaveLost
>>>>>> b) Trying to send TASK_LOST
>>>>>> 
>>>>>> And then the Scheduler logs (and/or the metrics it exports)
>>>>>> should tell
>>>>> 
>>>>> you
>>>>>> whether those events were received. If this is reproducible,
>>>>>> I'd
>>>>> 
>>>>> consider
>>>>>> it a serious bug.
>>>>>> 
>>>>>> On Fri, Jul 14, 2017 at 10:04 AM, Meghdoot bhattacharya <
>>>>>> meghdoo...@yahoo.com.invalid> wrote:
>>>>>> 
>>>>>>> So in this situation why is not aurora replacing the tasks
>>>>>>> and waiting
>>>>> 
>>>>> for
>>>>>>> external recon to fix it.
>>>>>>> 
>>>>>>> This is different when the 75 sec (5*15) health check of
>>>>>>> slave times
>>>>> 
>>>>> out
>>>>>>> (no master failover), aurora replaces it on task lost
>>>>>>> message.
>>>>>>> 
>>>>>>> Are you hinting we should ask mesos folks why in master
>>>>>>> fail over
>>>>>>> reregistration timeout scenario why task lost not sent
>>>>>>> though slave
>>>>> 
>>>>> lost
>>>>>>> sent and from below docs task lost should have been sent.
>>>>>>> 
>>>>>>> Because either mesos is not sending the right status or
>>>>>>> aurora is not
>>>>>>> handling it.
>>>>>>> 
>>>>>>> Thx
>>>>>>> 
>>>>>>>> On Jul 14, 2017, at 8:21 AM, David McLaughlin <dmclaughli
>>>>>>>> n...@apache.org
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> "1. When mesos sends slave lost after 10 mins in this
>>>>>>>> situation , why
>>>>>>> 
>>>>>>> does
>>>>>>>> aurora not act on it?"
>>>>>>>> 
>>>>>>>> Because Mesos also sends TASK_LOST for every task running
>>>>>>>> on the agent
>>>>>>>> whenever it calls slaveLost:
>>>>>>>> 
>>>>>>>> When it is time to remove an agent, the master removes
>>>>>>>> the agent from
>>>>> 
>>>>> the
>>>>>>>> list of registered agents in the master’s durable state
>>>>>>>> <http://mesos.apache.org/documentation/latest/replicated-
>>>>> 
>>>>> log-internals/>
>>>>>>> (this
>>>>>>>> will survive master failover). The master sends a
>>>>>>>> slaveLost callback
>>>>> 
>>>>> to
>>>>>>>> every registered scheduler driver; it also sends
>>>>>>>> TASK_LOST status
>>>>> 
>>>>> updates
>>>>>>>> for every task that was running on the removed agent.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, Jul 13, 2017 at 4:32 PM, meghdoot bhattacharya <
>>>>>>>> meghdoo...@yahoo.com.invalid> wrote:
>>>>>>>> 
>>>>>>>>> We were investigation slave re registration behavior on
>>>>>>>>> master fail
>>>>> 
>>>>> over
>>>>>>>>> in Aurora 0.17 with mesos 1.1.
>>>>>>>>> Few important points
>>>>>>>>> http://mesos.apache.org/documentation/latest/high-
>>>>>>>>> availability-framework-guide/ (If an agent does not
>>>>>>>>> reregister with
>>>>> 
>>>>> the
>>>>>>>>> new master within a timeout (controlled by the
>>>>>>> 
>>>>>>> --agent_reregister_timeout
>>>>>>>>> configuration flag), the master marks the agent as
>>>>>>>>> failed and follows
>>>>>>> 
>>>>>>> the
>>>>>>>>> same steps described above. However, there is one
>>>>>>>>> difference: by
>>>>>>> 
>>>>>>> default,
>>>>>>>>> agents are allowed to reconnect following master
>>>>>>>>> failover, even after
>>>>>>> 
>>>>>>> the
>>>>>>>>> agent_reregister_timeout has fired. This means that
>>>>>>>>> frameworks might
>>>>>>> 
>>>>>>> see a
>>>>>>>>> TASK_LOST update for a task but then later discover
>>>>>>>>> that the task is
>>>>>>>>> running (because the agent where it was running was
>>>>>>>>> allowed to
>>>>>>> 
>>>>>>> reconnect).
>>>>>>>>> http://mesos.apache.org/documentation/latest/reconcilia
>>>>>>>>> tion/
>>>>> 
>>>>> (Implicit
>>>>>>>>> reconciliation (passing an empty list) should also be
>>>>>>>>> used
>>>>>>> 
>>>>>>> periodically, as
>>>>>>>>> a defense against data loss in the framework. Unless a
>>>>>>>>> strict
>>>>> 
>>>>> registry
>>>>>>> is
>>>>>>>>> in use on the master, its possible for tasks to
>>>>>>>>> resurrect from a LOST
>>>>>>> 
>>>>>>> state
>>>>>>>>> (without a strict registry the master does not enforce
>>>>>>>>> agent removal
>>>>>>> 
>>>>>>> across
>>>>>>>>> failovers). When an unknown task is encountered, the
>>>>>>>>> scheduler should
>>>>>>> 
>>>>>>> kill
>>>>>>>>> or recover the task.)
>>>>>>>>> https://issues.apache.org/jira/browse/MESOS-5951(Remove
>>>>>>>>> s strict
>>>>>>> 
>>>>>>> registry
>>>>>>>>> mode flag from 1.1 and reverts to the old behavior of
>>>>>>>>> non strict
>>>>>>> 
>>>>>>> registry
>>>>>>>>> mode where tasks and executors were not killed on agent
>>>>> 
>>>>> reregistration
>>>>>>>>> timeout on master failover)
>>>>>>>>> So, what we find, if the slave does not come back after
>>>>>>>>> 10 mins
>>>>>>>>> 1. Mesos master sends slave lost but not task lost to
>>>>>>>>> Aurora.2.
>>>>> 
>>>>> Aurora
>>>>>>>>> does not replace the tasks.3. When explicit recon
>>>>>>>>> starts , then only
>>>>>>> 
>>>>>>> this
>>>>>>>>> gets corrected with aurora spawning replacement tasks.
>>>>>>>>> If slave restarts after 10 mins
>>>>>>>>> 1. When implicit recon starts, this situation gets
>>>>>>>>> fixed because in
>>>>>>> 
>>>>>>> aurora
>>>>>>>>> it is marked as lost and mesos sends running and those
>>>>>>>>> get killed and
>>>>>>>>> replaced.
>>>>>>>>> So, questions
>>>>>>>>> 1. When mesos sends slave lost after 10 mins in this
>>>>>>>>> situation , why
>>>>>>> 
>>>>>>> does
>>>>>>>>> aurora not act on it?2. As per recon docs best
>>>>>>>>> practices, explicit
>>>>> 
>>>>> recon
>>>>>>>>> should start followed by implicit recon on master
>>>>>>>>> failover. Looks
>>>>> 
>>>>> like
>>>>>>>>> aurora is not doing that and the regular hourly recons
>>>>>>>>> are running
>>>>> 
>>>>> with
>>>>>>> 30
>>>>>>>>> min spread between explicit and implicit. Should aurora
>>>>>>>>> do recon on
>>>>>>> 
>>>>>>> master
>>>>>>>>> fail over?
>>>>>>>>> 
>>>>>>>>> General questions1. What is the effect on aurora if we
>>>>>>>>> make explicit
>>>>>>> 
>>>>>>> recon
>>>>>>>>> every 15 mins instead of default 1 hr? Does it slow
>>>>>>>>> down scheduling,
>>>>>>> 
>>>>>>> does
>>>>>>>>> snapshot creation gets delayed etc?
>>>>>>>>> 2. Any issue if spread between explicit recon and
>>>>>>>>> implicit recon
>>>>> 
>>>>> brought
>>>>>>>>> down to 2 mins from 30 mins? probably depend on 1.
>>>>>>>>> Thx
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>> 
>>
Re: Aurora reconciliation and Master fail over

Reply via email to