Based on latest thread in mesos, either we have to do recon on agent removed scenario, or I am guessing aurora has a mapping of tasks to agent and then forcing a lost on agent removal without recon.
Thx > On Jul 16, 2017, at 11:10 AM, Meghdoot bhattacharya > <meghdoo...@yahoo.com.INVALID> wrote: > > Got it. Thx! > >> On Jul 16, 2017, at 9:49 AM, Stephan Erb <m...@stephanerb.eu> wrote: >> >> Reconciliation in Aurora is not a specific mode. It just runs >> concurrently to other background work such as snapshots or backups [1]. >> >> >> Just be aware that we don't have metrics to track the runtime of >> explicit and implicit reconciliations. If you use settings that are >> overly aggressive, you might overload Auroras queue of incoming Mesos >> status updates (for example). >> >> [1] https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d5 >> 7069adda434/src/main/java/org/apache/aurora/scheduler/reconciliation/Ta >> skReconciler.java >> >> >>> On Sat, 2017-07-15 at 22:28 -0700, Meghdoot bhattacharya wrote: >>> Thx David for the follow up and confirmation. >>> We have started the thread on the mesos dev DL. >>> >>> So to get clarification on the recon, what is in general effect >>> during the recon. Does scheduling and activities like snapshot is >>> paused as recon takes place. Trying to see whether to run aggressive >>> recon in mean time. >>> >>> Thx >>> >>>> On Jul 15, 2017, at 9:33 AM, David McLaughlin <dmclaughlin@apache.o >>>> rg> wrote: >>>> >>>> I've left a comment on the initial RB detailing how the change >>>> broke >>>> backwards-compatibility. Given that the tasks are marked as lost as >>>> soon as >>>> the agent reregisters after slaveLost is sent anyway, there doesn't >>>> seem to >>>> be any reason not to send TASK_LOST too. I think this should be an >>>> easy >>>> fix. >>>> >>>> On Sat, Jul 15, 2017 at 9:21 AM, David McLaughlin <dmclaughlin@apac >>>> he.org> >>>> wrote: >>>> >>>>> Yes, we've confirmed this internally too (Santhosh did the work >>>>> here): >>>>> >>>>> When an agent becomes unreachable while the master is running, it >>>>> sends >>>>>> TASK_LOST events for each task on the agent. >>>>>> https://github.com/apache/mesos/blob/33093c893773f8c9d293afe >>>>>> 38e9909f9a2868d32/src/master/master.cpp#L7066-L7107 >>>>>> Marking agent unreachable after failover does not cause >>>>>> TASK_LOST events. >>>>>> https://github.com/apache/mesos/blob/33093c893773f8c9d293afe >>>>>> 38e9909f9a2868d32/src/master/master.cpp#L2036-L2070 >>>>>> Once an agent re-registers it sends TASK_LOST events. Agent >>>>>> sending >>>>>> TASK_LOST for tasks that it does not know after a Master >>>>>> failover. >>>>>> https://github.com/apache/mesos/blob/33093c893773f8c9d293afe >>>>>> 38e9909f9a2868d32/src/slave/slave.cpp#L1324-L1383 >>>>> >>>>> >>>>> >>>>> The separate code path for markUnreachableAfterFailover appears >>>>> to have >>>>> been added by this commit: >>>>> https://github.com/apache/mesos/commit/937c85f2f6528d1ac56ea9a7aa >>>>> 174c >>>>> a0bd371d0c >>>>> >>>>> And I think this totally breaks the promise of introducing the >>>>> PARTITION_AWARE stuff in a backwards-compatible way. >>>>> >>>>> So right now, yes we rely on reconciliation to finally mark the >>>>> tasks as >>>>> LOST and reschedule their replacements. >>>>> >>>>> I think the only reason we haven't been more impacted by this at >>>>> Twitter >>>>> is our Mesos master is remarkably stable (compared to Aurora's >>>>> daily >>>>> failovers). >>>>> >>>>> We have two paths forward here: push forward and embrace the new >>>>> partition >>>>> awareness features in Aurora and/or push back on the above change >>>>> with the >>>>> Mesos community and have a better story for non-partition aware >>>>> APIs in the >>>>> short term. >>>>> >>>>> >>>>> >>>>> On Sat, Jul 15, 2017 at 2:01 AM, Meghdoot bhattacharya < >>>>> meghdoo...@yahoo.com.invalid> wrote: >>>>> >>>>>> We can reproduce it easily as the steps are >>>>>> 1. Shut down leading mesos master >>>>>> 2. Shutdown agent at same time >>>>>> 3. Wait for 10 mins >>>>>> >>>>>> What Renan and I saw in the logs were only agent lost and not >>>>>> task lost >>>>>> sent. While in regular health check expire scenario both task >>>>>> lost and >>>>>> agent lost were sent. >>>>>> >>>>>> So yes this is very concerning. >>>>>> >>>>>> Thx >>>>>> >>>>>>> On Jul 14, 2017, at 10:28 AM, David McLaughlin <dmclaughlin@a >>>>>>> pache.org> >>>>>> >>>>>> wrote: >>>>>>> >>>>>>> It would be interesting to see the logs. I think that will >>>>>>> tell you if >>>>>> >>>>>> the >>>>>>> Mesos master is: >>>>>>> >>>>>>> a) Sending slaveLost >>>>>>> b) Trying to send TASK_LOST >>>>>>> >>>>>>> And then the Scheduler logs (and/or the metrics it exports) >>>>>>> should tell >>>>>> >>>>>> you >>>>>>> whether those events were received. If this is reproducible, >>>>>>> I'd >>>>>> >>>>>> consider >>>>>>> it a serious bug. >>>>>>> >>>>>>> On Fri, Jul 14, 2017 at 10:04 AM, Meghdoot bhattacharya < >>>>>>> meghdoo...@yahoo.com.invalid> wrote: >>>>>>> >>>>>>>> So in this situation why is not aurora replacing the tasks >>>>>>>> and waiting >>>>>> >>>>>> for >>>>>>>> external recon to fix it. >>>>>>>> >>>>>>>> This is different when the 75 sec (5*15) health check of >>>>>>>> slave times >>>>>> >>>>>> out >>>>>>>> (no master failover), aurora replaces it on task lost >>>>>>>> message. >>>>>>>> >>>>>>>> Are you hinting we should ask mesos folks why in master >>>>>>>> fail over >>>>>>>> reregistration timeout scenario why task lost not sent >>>>>>>> though slave >>>>>> >>>>>> lost >>>>>>>> sent and from below docs task lost should have been sent. >>>>>>>> >>>>>>>> Because either mesos is not sending the right status or >>>>>>>> aurora is not >>>>>>>> handling it. >>>>>>>> >>>>>>>> Thx >>>>>>>> >>>>>>>>> On Jul 14, 2017, at 8:21 AM, David McLaughlin <dmclaughli >>>>>>>>> n...@apache.org >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> "1. When mesos sends slave lost after 10 mins in this >>>>>>>>> situation , why >>>>>>>> >>>>>>>> does >>>>>>>>> aurora not act on it?" >>>>>>>>> >>>>>>>>> Because Mesos also sends TASK_LOST for every task running >>>>>>>>> on the agent >>>>>>>>> whenever it calls slaveLost: >>>>>>>>> >>>>>>>>> When it is time to remove an agent, the master removes >>>>>>>>> the agent from >>>>>> >>>>>> the >>>>>>>>> list of registered agents in the master’s durable state >>>>>>>>> <http://mesos.apache.org/documentation/latest/replicated- >>>>>> >>>>>> log-internals/> >>>>>>>> (this >>>>>>>>> will survive master failover). The master sends a >>>>>>>>> slaveLost callback >>>>>> >>>>>> to >>>>>>>>> every registered scheduler driver; it also sends >>>>>>>>> TASK_LOST status >>>>>> >>>>>> updates >>>>>>>>> for every task that was running on the removed agent. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Jul 13, 2017 at 4:32 PM, meghdoot bhattacharya < >>>>>>>>> meghdoo...@yahoo.com.invalid> wrote: >>>>>>>>> >>>>>>>>>> We were investigation slave re registration behavior on >>>>>>>>>> master fail >>>>>> >>>>>> over >>>>>>>>>> in Aurora 0.17 with mesos 1.1. >>>>>>>>>> Few important points >>>>>>>>>> http://mesos.apache.org/documentation/latest/high- >>>>>>>>>> availability-framework-guide/ (If an agent does not >>>>>>>>>> reregister with >>>>>> >>>>>> the >>>>>>>>>> new master within a timeout (controlled by the >>>>>>>> >>>>>>>> --agent_reregister_timeout >>>>>>>>>> configuration flag), the master marks the agent as >>>>>>>>>> failed and follows >>>>>>>> >>>>>>>> the >>>>>>>>>> same steps described above. However, there is one >>>>>>>>>> difference: by >>>>>>>> >>>>>>>> default, >>>>>>>>>> agents are allowed to reconnect following master >>>>>>>>>> failover, even after >>>>>>>> >>>>>>>> the >>>>>>>>>> agent_reregister_timeout has fired. This means that >>>>>>>>>> frameworks might >>>>>>>> >>>>>>>> see a >>>>>>>>>> TASK_LOST update for a task but then later discover >>>>>>>>>> that the task is >>>>>>>>>> running (because the agent where it was running was >>>>>>>>>> allowed to >>>>>>>> >>>>>>>> reconnect). >>>>>>>>>> http://mesos.apache.org/documentation/latest/reconcilia >>>>>>>>>> tion/ >>>>>> >>>>>> (Implicit >>>>>>>>>> reconciliation (passing an empty list) should also be >>>>>>>>>> used >>>>>>>> >>>>>>>> periodically, as >>>>>>>>>> a defense against data loss in the framework. Unless a >>>>>>>>>> strict >>>>>> >>>>>> registry >>>>>>>> is >>>>>>>>>> in use on the master, its possible for tasks to >>>>>>>>>> resurrect from a LOST >>>>>>>> >>>>>>>> state >>>>>>>>>> (without a strict registry the master does not enforce >>>>>>>>>> agent removal >>>>>>>> >>>>>>>> across >>>>>>>>>> failovers). When an unknown task is encountered, the >>>>>>>>>> scheduler should >>>>>>>> >>>>>>>> kill >>>>>>>>>> or recover the task.) >>>>>>>>>> https://issues.apache.org/jira/browse/MESOS-5951(Remove >>>>>>>>>> s strict >>>>>>>> >>>>>>>> registry >>>>>>>>>> mode flag from 1.1 and reverts to the old behavior of >>>>>>>>>> non strict >>>>>>>> >>>>>>>> registry >>>>>>>>>> mode where tasks and executors were not killed on agent >>>>>> >>>>>> reregistration >>>>>>>>>> timeout on master failover) >>>>>>>>>> So, what we find, if the slave does not come back after >>>>>>>>>> 10 mins >>>>>>>>>> 1. Mesos master sends slave lost but not task lost to >>>>>>>>>> Aurora.2. >>>>>> >>>>>> Aurora >>>>>>>>>> does not replace the tasks.3. When explicit recon >>>>>>>>>> starts , then only >>>>>>>> >>>>>>>> this >>>>>>>>>> gets corrected with aurora spawning replacement tasks. >>>>>>>>>> If slave restarts after 10 mins >>>>>>>>>> 1. When implicit recon starts, this situation gets >>>>>>>>>> fixed because in >>>>>>>> >>>>>>>> aurora >>>>>>>>>> it is marked as lost and mesos sends running and those >>>>>>>>>> get killed and >>>>>>>>>> replaced. >>>>>>>>>> So, questions >>>>>>>>>> 1. When mesos sends slave lost after 10 mins in this >>>>>>>>>> situation , why >>>>>>>> >>>>>>>> does >>>>>>>>>> aurora not act on it?2. As per recon docs best >>>>>>>>>> practices, explicit >>>>>> >>>>>> recon >>>>>>>>>> should start followed by implicit recon on master >>>>>>>>>> failover. Looks >>>>>> >>>>>> like >>>>>>>>>> aurora is not doing that and the regular hourly recons >>>>>>>>>> are running >>>>>> >>>>>> with >>>>>>>> 30 >>>>>>>>>> min spread between explicit and implicit. Should aurora >>>>>>>>>> do recon on >>>>>>>> >>>>>>>> master >>>>>>>>>> fail over? >>>>>>>>>> >>>>>>>>>> General questions1. What is the effect on aurora if we >>>>>>>>>> make explicit >>>>>>>> >>>>>>>> recon >>>>>>>>>> every 15 mins instead of default 1 hr? Does it slow >>>>>>>>>> down scheduling, >>>>>>>> >>>>>>>> does >>>>>>>>>> snapshot creation gets delayed etc? >>>>>>>>>> 2. Any issue if spread between explicit recon and >>>>>>>>>> implicit recon >>>>>> >>>>>> brought >>>>>>>>>> down to 2 mins from 30 mins? probably depend on 1. >>>>>>>>>> Thx >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>> >>> >