Got it. Thx!
> On Jul 16, 2017, at 9:49 AM, Stephan Erb <m...@stephanerb.eu> wrote:
>
> Reconciliation in Aurora is not a specific mode. It just runs
> concurrently to other background work such as snapshots or backups [1].
>
>
> Just be aware that we don't have metrics to track the runtime of
> explicit and implicit reconciliations. If you use settings that are
> overly aggressive, you might overload Auroras queue of incoming Mesos
> status updates (for example).
>
> [1] https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d5
> 7069adda434/src/main/java/org/apache/aurora/scheduler/reconciliation/Ta
> skReconciler.java
>
>
>> On Sat, 2017-07-15 at 22:28 -0700, Meghdoot bhattacharya wrote:
>> Thx David for the follow up and confirmation.
>> We have started the thread on the mesos dev DL.
>>
>> So to get clarification on the recon, what is in general effect
>> during the recon. Does scheduling and activities like snapshot is
>> paused as recon takes place. Trying to see whether to run aggressive
>> recon in mean time.
>>
>> Thx
>>
>>> On Jul 15, 2017, at 9:33 AM, David McLaughlin <dmclaughlin@apache.o
>>> rg> wrote:
>>>
>>> I've left a comment on the initial RB detailing how the change
>>> broke
>>> backwards-compatibility. Given that the tasks are marked as lost as
>>> soon as
>>> the agent reregisters after slaveLost is sent anyway, there doesn't
>>> seem to
>>> be any reason not to send TASK_LOST too. I think this should be an
>>> easy
>>> fix.
>>>
>>> On Sat, Jul 15, 2017 at 9:21 AM, David McLaughlin <dmclaughlin@apac
>>> he.org>
>>> wrote:
>>>
>>>> Yes, we've confirmed this internally too (Santhosh did the work
>>>> here):
>>>>
>>>> When an agent becomes unreachable while the master is running, it
>>>> sends
>>>>> TASK_LOST events for each task on the agent.
>>>>> https://github.com/apache/mesos/blob/33093c893773f8c9d293afe
>>>>> 38e9909f9a2868d32/src/master/master.cpp#L7066-L7107
>>>>> Marking agent unreachable after failover does not cause
>>>>> TASK_LOST events.
>>>>> https://github.com/apache/mesos/blob/33093c893773f8c9d293afe
>>>>> 38e9909f9a2868d32/src/master/master.cpp#L2036-L2070
>>>>> Once an agent re-registers it sends TASK_LOST events. Agent
>>>>> sending
>>>>> TASK_LOST for tasks that it does not know after a Master
>>>>> failover.
>>>>> https://github.com/apache/mesos/blob/33093c893773f8c9d293afe
>>>>> 38e9909f9a2868d32/src/slave/slave.cpp#L1324-L1383
>>>>
>>>>
>>>>
>>>> The separate code path for markUnreachableAfterFailover appears
>>>> to have
>>>> been added by this commit:
>>>> https://github.com/apache/mesos/commit/937c85f2f6528d1ac56ea9a7aa
>>>> 174c
>>>> a0bd371d0c
>>>>
>>>> And I think this totally breaks the promise of introducing the
>>>> PARTITION_AWARE stuff in a backwards-compatible way.
>>>>
>>>> So right now, yes we rely on reconciliation to finally mark the
>>>> tasks as
>>>> LOST and reschedule their replacements.
>>>>
>>>> I think the only reason we haven't been more impacted by this at
>>>> Twitter
>>>> is our Mesos master is remarkably stable (compared to Aurora's
>>>> daily
>>>> failovers).
>>>>
>>>> We have two paths forward here: push forward and embrace the new
>>>> partition
>>>> awareness features in Aurora and/or push back on the above change
>>>> with the
>>>> Mesos community and have a better story for non-partition aware
>>>> APIs in the
>>>> short term.
>>>>
>>>>
>>>>
>>>> On Sat, Jul 15, 2017 at 2:01 AM, Meghdoot bhattacharya <
>>>> meghdoo...@yahoo.com.invalid> wrote:
>>>>
>>>>> We can reproduce it easily as the steps are
>>>>> 1. Shut down leading mesos master
>>>>> 2. Shutdown agent at same time
>>>>> 3. Wait for 10 mins
>>>>>
>>>>> What Renan and I saw in the logs were only agent lost and not
>>>>> task lost
>>>>> sent. While in regular health check expire scenario both task
>>>>> lost and
>>>>> agent lost were sent.
>>>>>
>>>>> So yes this is very concerning.
>>>>>
>>>>> Thx
>>>>>
>>>>>> On Jul 14, 2017, at 10:28 AM, David McLaughlin <dmclaughlin@a
>>>>>> pache.org>
>>>>>
>>>>> wrote:
>>>>>>
>>>>>> It would be interesting to see the logs. I think that will
>>>>>> tell you if
>>>>>
>>>>> the
>>>>>> Mesos master is:
>>>>>>
>>>>>> a) Sending slaveLost
>>>>>> b) Trying to send TASK_LOST
>>>>>>
>>>>>> And then the Scheduler logs (and/or the metrics it exports)
>>>>>> should tell
>>>>>
>>>>> you
>>>>>> whether those events were received. If this is reproducible,
>>>>>> I'd
>>>>>
>>>>> consider
>>>>>> it a serious bug.
>>>>>>
>>>>>> On Fri, Jul 14, 2017 at 10:04 AM, Meghdoot bhattacharya <
>>>>>> meghdoo...@yahoo.com.invalid> wrote:
>>>>>>
>>>>>>> So in this situation why is not aurora replacing the tasks
>>>>>>> and waiting
>>>>>
>>>>> for
>>>>>>> external recon to fix it.
>>>>>>>
>>>>>>> This is different when the 75 sec (5*15) health check of
>>>>>>> slave times
>>>>>
>>>>> out
>>>>>>> (no master failover), aurora replaces it on task lost
>>>>>>> message.
>>>>>>>
>>>>>>> Are you hinting we should ask mesos folks why in master
>>>>>>> fail over
>>>>>>> reregistration timeout scenario why task lost not sent
>>>>>>> though slave
>>>>>
>>>>> lost
>>>>>>> sent and from below docs task lost should have been sent.
>>>>>>>
>>>>>>> Because either mesos is not sending the right status or
>>>>>>> aurora is not
>>>>>>> handling it.
>>>>>>>
>>>>>>> Thx
>>>>>>>
>>>>>>>> On Jul 14, 2017, at 8:21 AM, David McLaughlin <dmclaughli
>>>>>>>> n...@apache.org
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> "1. When mesos sends slave lost after 10 mins in this
>>>>>>>> situation , why
>>>>>>>
>>>>>>> does
>>>>>>>> aurora not act on it?"
>>>>>>>>
>>>>>>>> Because Mesos also sends TASK_LOST for every task running
>>>>>>>> on the agent
>>>>>>>> whenever it calls slaveLost:
>>>>>>>>
>>>>>>>> When it is time to remove an agent, the master removes
>>>>>>>> the agent from
>>>>>
>>>>> the
>>>>>>>> list of registered agents in the master’s durable state
>>>>>>>> <http://mesos.apache.org/documentation/latest/replicated-
>>>>>
>>>>> log-internals/>
>>>>>>> (this
>>>>>>>> will survive master failover). The master sends a
>>>>>>>> slaveLost callback
>>>>>
>>>>> to
>>>>>>>> every registered scheduler driver; it also sends
>>>>>>>> TASK_LOST status
>>>>>
>>>>> updates
>>>>>>>> for every task that was running on the removed agent.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jul 13, 2017 at 4:32 PM, meghdoot bhattacharya <
>>>>>>>> meghdoo...@yahoo.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> We were investigation slave re registration behavior on
>>>>>>>>> master fail
>>>>>
>>>>> over
>>>>>>>>> in Aurora 0.17 with mesos 1.1.
>>>>>>>>> Few important points
>>>>>>>>> http://mesos.apache.org/documentation/latest/high-
>>>>>>>>> availability-framework-guide/ (If an agent does not
>>>>>>>>> reregister with
>>>>>
>>>>> the
>>>>>>>>> new master within a timeout (controlled by the
>>>>>>>
>>>>>>> --agent_reregister_timeout
>>>>>>>>> configuration flag), the master marks the agent as
>>>>>>>>> failed and follows
>>>>>>>
>>>>>>> the
>>>>>>>>> same steps described above. However, there is one
>>>>>>>>> difference: by
>>>>>>>
>>>>>>> default,
>>>>>>>>> agents are allowed to reconnect following master
>>>>>>>>> failover, even after
>>>>>>>
>>>>>>> the
>>>>>>>>> agent_reregister_timeout has fired. This means that
>>>>>>>>> frameworks might
>>>>>>>
>>>>>>> see a
>>>>>>>>> TASK_LOST update for a task but then later discover
>>>>>>>>> that the task is
>>>>>>>>> running (because the agent where it was running was
>>>>>>>>> allowed to
>>>>>>>
>>>>>>> reconnect).
>>>>>>>>> http://mesos.apache.org/documentation/latest/reconcilia
>>>>>>>>> tion/
>>>>>
>>>>> (Implicit
>>>>>>>>> reconciliation (passing an empty list) should also be
>>>>>>>>> used
>>>>>>>
>>>>>>> periodically, as
>>>>>>>>> a defense against data loss in the framework. Unless a
>>>>>>>>> strict
>>>>>
>>>>> registry
>>>>>>> is
>>>>>>>>> in use on the master, its possible for tasks to
>>>>>>>>> resurrect from a LOST
>>>>>>>
>>>>>>> state
>>>>>>>>> (without a strict registry the master does not enforce
>>>>>>>>> agent removal
>>>>>>>
>>>>>>> across
>>>>>>>>> failovers). When an unknown task is encountered, the
>>>>>>>>> scheduler should
>>>>>>>
>>>>>>> kill
>>>>>>>>> or recover the task.)
>>>>>>>>> https://issues.apache.org/jira/browse/MESOS-5951(Remove
>>>>>>>>> s strict
>>>>>>>
>>>>>>> registry
>>>>>>>>> mode flag from 1.1 and reverts to the old behavior of
>>>>>>>>> non strict
>>>>>>>
>>>>>>> registry
>>>>>>>>> mode where tasks and executors were not killed on agent
>>>>>
>>>>> reregistration
>>>>>>>>> timeout on master failover)
>>>>>>>>> So, what we find, if the slave does not come back after
>>>>>>>>> 10 mins
>>>>>>>>> 1. Mesos master sends slave lost but not task lost to
>>>>>>>>> Aurora.2.
>>>>>
>>>>> Aurora
>>>>>>>>> does not replace the tasks.3. When explicit recon
>>>>>>>>> starts , then only
>>>>>>>
>>>>>>> this
>>>>>>>>> gets corrected with aurora spawning replacement tasks.
>>>>>>>>> If slave restarts after 10 mins
>>>>>>>>> 1. When implicit recon starts, this situation gets
>>>>>>>>> fixed because in
>>>>>>>
>>>>>>> aurora
>>>>>>>>> it is marked as lost and mesos sends running and those
>>>>>>>>> get killed and
>>>>>>>>> replaced.
>>>>>>>>> So, questions
>>>>>>>>> 1. When mesos sends slave lost after 10 mins in this
>>>>>>>>> situation , why
>>>>>>>
>>>>>>> does
>>>>>>>>> aurora not act on it?2. As per recon docs best
>>>>>>>>> practices, explicit
>>>>>
>>>>> recon
>>>>>>>>> should start followed by implicit recon on master
>>>>>>>>> failover. Looks
>>>>>
>>>>> like
>>>>>>>>> aurora is not doing that and the regular hourly recons
>>>>>>>>> are running
>>>>>
>>>>> with
>>>>>>> 30
>>>>>>>>> min spread between explicit and implicit. Should aurora
>>>>>>>>> do recon on
>>>>>>>
>>>>>>> master
>>>>>>>>> fail over?
>>>>>>>>>
>>>>>>>>> General questions1. What is the effect on aurora if we
>>>>>>>>> make explicit
>>>>>>>
>>>>>>> recon
>>>>>>>>> every 15 mins instead of default 1 hr? Does it slow
>>>>>>>>> down scheduling,
>>>>>>>
>>>>>>> does
>>>>>>>>> snapshot creation gets delayed etc?
>>>>>>>>> 2. Any issue if spread between explicit recon and
>>>>>>>>> implicit recon
>>>>>
>>>>> brought
>>>>>>>>> down to 2 mins from 30 mins? probably depend on 1.
>>>>>>>>> Thx
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>
>>