Re: Aurora reconciliation and Master fail over

Meghdoot bhattacharya Mon, 17 Jul 2017 14:57:58 -0700

Based on latest thread in mesos, either we have to do recon on agent removed 
scenario, or I am guessing aurora has a mapping of tasks to agent and then 
forcing a lost on agent removal without recon.


Thx

> On Jul 16, 2017, at 11:10 AM, Meghdoot bhattacharya 
> <meghdoo...@yahoo.com.INVALID> wrote:
> 
> Got it. Thx!
> 
>> On Jul 16, 2017, at 9:49 AM, Stephan Erb <m...@stephanerb.eu> wrote:
>> 
>> Reconciliation in Aurora is not a specific mode. It just runs
>> concurrently to other background work such as snapshots or backups [1].
>> 
>> 
>> Just be aware that we don't have metrics to track the runtime of
>> explicit and implicit reconciliations. If you use settings that are
>> overly aggressive, you might overload Auroras queue of incoming Mesos
>> status updates (for example). 
>> 
>> [1] https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d5
>> 7069adda434/src/main/java/org/apache/aurora/scheduler/reconciliation/Ta
>> skReconciler.java 
>> 
>> 
>>> On Sat, 2017-07-15 at 22:28 -0700, Meghdoot bhattacharya wrote:
>>> Thx David for the follow up and confirmation.
>>> We have started the thread on the mesos dev DL.
>>> 
>>> So to get clarification on the recon, what is in general effect
>>> during the recon. Does scheduling and activities like snapshot is
>>> paused as recon takes place. Trying to see whether to run aggressive
>>> recon in mean time.
>>> 
>>> Thx
>>> 
>>>> On Jul 15, 2017, at 9:33 AM, David McLaughlin <dmclaughlin@apache.o
>>>> rg> wrote:
>>>> 
>>>> I've left a comment on the initial RB detailing how the change
>>>> broke
>>>> backwards-compatibility. Given that the tasks are marked as lost as
>>>> soon as
>>>> the agent reregisters after slaveLost is sent anyway, there doesn't
>>>> seem to
>>>> be any reason not to send TASK_LOST too. I think this should be an
>>>> easy
>>>> fix.
>>>> 
>>>> On Sat, Jul 15, 2017 at 9:21 AM, David McLaughlin <dmclaughlin@apac
>>>> he.org>
>>>> wrote:
>>>> 
>>>>> Yes, we've confirmed this internally too (Santhosh did the work
>>>>> here):
>>>>> 
>>>>> When an agent becomes unreachable while the master is running, it
>>>>> sends
>>>>>> TASK_LOST events for each task on the agent.
>>>>>> https://github.com/apache/mesos/blob/33093c893773f8c9d293afe
>>>>>> 38e9909f9a2868d32/src/master/master.cpp#L7066-L7107
>>>>>> Marking agent unreachable after failover does not cause
>>>>>> TASK_LOST events.
>>>>>> https://github.com/apache/mesos/blob/33093c893773f8c9d293afe
>>>>>> 38e9909f9a2868d32/src/master/master.cpp#L2036-L2070
>>>>>> Once an agent re-registers it sends TASK_LOST events. Agent
>>>>>> sending
>>>>>> TASK_LOST for tasks that it does not know after a Master
>>>>>> failover.
>>>>>> https://github.com/apache/mesos/blob/33093c893773f8c9d293afe
>>>>>> 38e9909f9a2868d32/src/slave/slave.cpp#L1324-L1383
>>>>> 
>>>>> 
>>>>> 
>>>>> The separate code path for markUnreachableAfterFailover appears
>>>>> to have
>>>>> been added by this commit:
>>>>> https://github.com/apache/mesos/commit/937c85f2f6528d1ac56ea9a7aa
>>>>> 174c
>>>>> a0bd371d0c
>>>>> 
>>>>> And I think this totally breaks the promise of introducing the
>>>>> PARTITION_AWARE stuff in a backwards-compatible way.
>>>>> 
>>>>> So right now, yes we rely on reconciliation to finally mark the
>>>>> tasks as
>>>>> LOST and reschedule their replacements.
>>>>> 
>>>>> I think the only reason we haven't been more impacted by this at
>>>>> Twitter
>>>>> is our Mesos master is remarkably stable (compared to Aurora's
>>>>> daily
>>>>> failovers).
>>>>> 
>>>>> We have two paths forward here: push forward and embrace the new
>>>>> partition
>>>>> awareness features in Aurora and/or push back on the above change
>>>>> with the
>>>>> Mesos community and have a better story for non-partition aware
>>>>> APIs in the
>>>>> short term.
>>>>> 
>>>>> 
>>>>> 
>>>>> On Sat, Jul 15, 2017 at 2:01 AM, Meghdoot bhattacharya <
>>>>> meghdoo...@yahoo.com.invalid> wrote:
>>>>> 
>>>>>> We can reproduce it easily as the steps are
>>>>>> 1. Shut down leading mesos master
>>>>>> 2. Shutdown agent at same time
>>>>>> 3. Wait for 10 mins
>>>>>> 
>>>>>> What Renan and I saw in the logs were only agent lost and not
>>>>>> task lost
>>>>>> sent. While in regular health check expire scenario both task
>>>>>> lost and
>>>>>> agent lost were sent.
>>>>>> 
>>>>>> So yes this is very concerning.
>>>>>> 
>>>>>> Thx
>>>>>> 
>>>>>>> On Jul 14, 2017, at 10:28 AM, David McLaughlin <dmclaughlin@a
>>>>>>> pache.org>
>>>>>> 
>>>>>> wrote:
>>>>>>> 
>>>>>>> It would be interesting to see the logs. I think that will
>>>>>>> tell you if
>>>>>> 
>>>>>> the
>>>>>>> Mesos master is:
>>>>>>> 
>>>>>>> a) Sending slaveLost
>>>>>>> b) Trying to send TASK_LOST
>>>>>>> 
>>>>>>> And then the Scheduler logs (and/or the metrics it exports)
>>>>>>> should tell
>>>>>> 
>>>>>> you
>>>>>>> whether those events were received. If this is reproducible,
>>>>>>> I'd
>>>>>> 
>>>>>> consider
>>>>>>> it a serious bug.
>>>>>>> 
>>>>>>> On Fri, Jul 14, 2017 at 10:04 AM, Meghdoot bhattacharya <
>>>>>>> meghdoo...@yahoo.com.invalid> wrote:
>>>>>>> 
>>>>>>>> So in this situation why is not aurora replacing the tasks
>>>>>>>> and waiting
>>>>>> 
>>>>>> for
>>>>>>>> external recon to fix it.
>>>>>>>> 
>>>>>>>> This is different when the 75 sec (5*15) health check of
>>>>>>>> slave times
>>>>>> 
>>>>>> out
>>>>>>>> (no master failover), aurora replaces it on task lost
>>>>>>>> message.
>>>>>>>> 
>>>>>>>> Are you hinting we should ask mesos folks why in master
>>>>>>>> fail over
>>>>>>>> reregistration timeout scenario why task lost not sent
>>>>>>>> though slave
>>>>>> 
>>>>>> lost
>>>>>>>> sent and from below docs task lost should have been sent.
>>>>>>>> 
>>>>>>>> Because either mesos is not sending the right status or
>>>>>>>> aurora is not
>>>>>>>> handling it.
>>>>>>>> 
>>>>>>>> Thx
>>>>>>>> 
>>>>>>>>> On Jul 14, 2017, at 8:21 AM, David McLaughlin <dmclaughli
>>>>>>>>> n...@apache.org
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> "1. When mesos sends slave lost after 10 mins in this
>>>>>>>>> situation , why
>>>>>>>> 
>>>>>>>> does
>>>>>>>>> aurora not act on it?"
>>>>>>>>> 
>>>>>>>>> Because Mesos also sends TASK_LOST for every task running
>>>>>>>>> on the agent
>>>>>>>>> whenever it calls slaveLost:
>>>>>>>>> 
>>>>>>>>> When it is time to remove an agent, the master removes
>>>>>>>>> the agent from
>>>>>> 
>>>>>> the
>>>>>>>>> list of registered agents in the master’s durable state
>>>>>>>>> <http://mesos.apache.org/documentation/latest/replicated-
>>>>>> 
>>>>>> log-internals/>
>>>>>>>> (this
>>>>>>>>> will survive master failover). The master sends a
>>>>>>>>> slaveLost callback
>>>>>> 
>>>>>> to
>>>>>>>>> every registered scheduler driver; it also sends
>>>>>>>>> TASK_LOST status
>>>>>> 
>>>>>> updates
>>>>>>>>> for every task that was running on the removed agent.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, Jul 13, 2017 at 4:32 PM, meghdoot bhattacharya <
>>>>>>>>> meghdoo...@yahoo.com.invalid> wrote:
>>>>>>>>> 
>>>>>>>>>> We were investigation slave re registration behavior on
>>>>>>>>>> master fail
>>>>>> 
>>>>>> over
>>>>>>>>>> in Aurora 0.17 with mesos 1.1.
>>>>>>>>>> Few important points
>>>>>>>>>> http://mesos.apache.org/documentation/latest/high-
>>>>>>>>>> availability-framework-guide/ (If an agent does not
>>>>>>>>>> reregister with
>>>>>> 
>>>>>> the
>>>>>>>>>> new master within a timeout (controlled by the
>>>>>>>> 
>>>>>>>> --agent_reregister_timeout
>>>>>>>>>> configuration flag), the master marks the agent as
>>>>>>>>>> failed and follows
>>>>>>>> 
>>>>>>>> the
>>>>>>>>>> same steps described above. However, there is one
>>>>>>>>>> difference: by
>>>>>>>> 
>>>>>>>> default,
>>>>>>>>>> agents are allowed to reconnect following master
>>>>>>>>>> failover, even after
>>>>>>>> 
>>>>>>>> the
>>>>>>>>>> agent_reregister_timeout has fired. This means that
>>>>>>>>>> frameworks might
>>>>>>>> 
>>>>>>>> see a
>>>>>>>>>> TASK_LOST update for a task but then later discover
>>>>>>>>>> that the task is
>>>>>>>>>> running (because the agent where it was running was
>>>>>>>>>> allowed to
>>>>>>>> 
>>>>>>>> reconnect).
>>>>>>>>>> http://mesos.apache.org/documentation/latest/reconcilia
>>>>>>>>>> tion/
>>>>>> 
>>>>>> (Implicit
>>>>>>>>>> reconciliation (passing an empty list) should also be
>>>>>>>>>> used
>>>>>>>> 
>>>>>>>> periodically, as
>>>>>>>>>> a defense against data loss in the framework. Unless a
>>>>>>>>>> strict
>>>>>> 
>>>>>> registry
>>>>>>>> is
>>>>>>>>>> in use on the master, its possible for tasks to
>>>>>>>>>> resurrect from a LOST
>>>>>>>> 
>>>>>>>> state
>>>>>>>>>> (without a strict registry the master does not enforce
>>>>>>>>>> agent removal
>>>>>>>> 
>>>>>>>> across
>>>>>>>>>> failovers). When an unknown task is encountered, the
>>>>>>>>>> scheduler should
>>>>>>>> 
>>>>>>>> kill
>>>>>>>>>> or recover the task.)
>>>>>>>>>> https://issues.apache.org/jira/browse/MESOS-5951(Remove
>>>>>>>>>> s strict
>>>>>>>> 
>>>>>>>> registry
>>>>>>>>>> mode flag from 1.1 and reverts to the old behavior of
>>>>>>>>>> non strict
>>>>>>>> 
>>>>>>>> registry
>>>>>>>>>> mode where tasks and executors were not killed on agent
>>>>>> 
>>>>>> reregistration
>>>>>>>>>> timeout on master failover)
>>>>>>>>>> So, what we find, if the slave does not come back after
>>>>>>>>>> 10 mins
>>>>>>>>>> 1. Mesos master sends slave lost but not task lost to
>>>>>>>>>> Aurora.2.
>>>>>> 
>>>>>> Aurora
>>>>>>>>>> does not replace the tasks.3. When explicit recon
>>>>>>>>>> starts , then only
>>>>>>>> 
>>>>>>>> this
>>>>>>>>>> gets corrected with aurora spawning replacement tasks.
>>>>>>>>>> If slave restarts after 10 mins
>>>>>>>>>> 1. When implicit recon starts, this situation gets
>>>>>>>>>> fixed because in
>>>>>>>> 
>>>>>>>> aurora
>>>>>>>>>> it is marked as lost and mesos sends running and those
>>>>>>>>>> get killed and
>>>>>>>>>> replaced.
>>>>>>>>>> So, questions
>>>>>>>>>> 1. When mesos sends slave lost after 10 mins in this
>>>>>>>>>> situation , why
>>>>>>>> 
>>>>>>>> does
>>>>>>>>>> aurora not act on it?2. As per recon docs best
>>>>>>>>>> practices, explicit
>>>>>> 
>>>>>> recon
>>>>>>>>>> should start followed by implicit recon on master
>>>>>>>>>> failover. Looks
>>>>>> 
>>>>>> like
>>>>>>>>>> aurora is not doing that and the regular hourly recons
>>>>>>>>>> are running
>>>>>> 
>>>>>> with
>>>>>>>> 30
>>>>>>>>>> min spread between explicit and implicit. Should aurora
>>>>>>>>>> do recon on
>>>>>>>> 
>>>>>>>> master
>>>>>>>>>> fail over?
>>>>>>>>>> 
>>>>>>>>>> General questions1. What is the effect on aurora if we
>>>>>>>>>> make explicit
>>>>>>>> 
>>>>>>>> recon
>>>>>>>>>> every 15 mins instead of default 1 hr? Does it slow
>>>>>>>>>> down scheduling,
>>>>>>>> 
>>>>>>>> does
>>>>>>>>>> snapshot creation gets delayed etc?
>>>>>>>>>> 2. Any issue if spread between explicit recon and
>>>>>>>>>> implicit recon
>>>>>> 
>>>>>> brought
>>>>>>>>>> down to 2 mins from 30 mins? probably depend on 1.
>>>>>>>>>> Thx
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>> 
>>> 
>

Re: Aurora reconciliation and Master fail over

Reply via email to