Yup, that looks like the way to go. Going to go ahead and file a ticket on JIRA for this so that we don't forget. Thanks for digging into this David.
-Renan On Mon, Jul 17, 2017 at 3:00 PM, David McLaughlin <da...@dmclaughlin.com> wrote: > Based on the thread in the Mesos dev list, it looks like because they > don't persist task information so they don't have the task IDs to send when > they detect the agent is lost during failover. So unless this is changed on > the Mesos side, we need to act on the slaveLost message and mark all those > tasks as LOST in Aurora. > > Or rely on reconciliation. To reconcile more often, you should keep in > mind: > > 1) Implicit reconciliation sends one message to Mesos and Mesos replies > with N number of status updates immediately, where N = number of running > tasks. This process is usually quick (on the order of seconds) due to being > mostly NOOP status updates. When you have a large number of running tasks > (say 100k+), you may see some GC pressure due to the flood of status > updates. If this operation overlapped with another particularly expensive > operation (like a snapshot) it can cause a huge stop the world GC. But it > does not otherwise interfere with any operation. > > 2) Explicit reconciliation is done in batches, where Aurora batches up all > running tasks and sends one batch at a time, staggered by some delay. The > benefit here is there is less GC pressure, but the drawback is if you have > a lot of running tasks (again, 100k+), it will take over 10 minutes to > complete. So you have to make sure your reconciliation interval is aligned > with this (you can always increase the batch size to make this happen > faster). > > Cheers, > David > > On Sun, Jul 16, 2017 at 11:10 AM, Meghdoot bhattacharya < > meghdoo...@yahoo.com.invalid> wrote: > >> Got it. Thx! >> >> > On Jul 16, 2017, at 9:49 AM, Stephan Erb <m...@stephanerb.eu> wrote: >> > >> > Reconciliation in Aurora is not a specific mode. It just runs >> > concurrently to other background work such as snapshots or backups [1]. >> > >> > >> > Just be aware that we don't have metrics to track the runtime of >> > explicit and implicit reconciliations. If you use settings that are >> > overly aggressive, you might overload Auroras queue of incoming Mesos >> > status updates (for example). >> > >> > [1] https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d5 >> > 7069adda434/src/main/java/org/apache/aurora/scheduler/reconciliation/Ta >> > skReconciler.java >> > >> > >> >> On Sat, 2017-07-15 at 22:28 -0700, Meghdoot bhattacharya wrote: >> >> Thx David for the follow up and confirmation. >> >> We have started the thread on the mesos dev DL. >> >> >> >> So to get clarification on the recon, what is in general effect >> >> during the recon. Does scheduling and activities like snapshot is >> >> paused as recon takes place. Trying to see whether to run aggressive >> >> recon in mean time. >> >> >> >> Thx >> >> >> >>> On Jul 15, 2017, at 9:33 AM, David McLaughlin <dmclaughlin@apache.o >> >>> rg> wrote: >> >>> >> >>> I've left a comment on the initial RB detailing how the change >> >>> broke >> >>> backwards-compatibility. Given that the tasks are marked as lost as >> >>> soon as >> >>> the agent reregisters after slaveLost is sent anyway, there doesn't >> >>> seem to >> >>> be any reason not to send TASK_LOST too. I think this should be an >> >>> easy >> >>> fix. >> >>> >> >>> On Sat, Jul 15, 2017 at 9:21 AM, David McLaughlin <dmclaughlin@apac >> >>> he.org> >> >>> wrote: >> >>> >> >>>> Yes, we've confirmed this internally too (Santhosh did the work >> >>>> here): >> >>>> >> >>>> When an agent becomes unreachable while the master is running, it >> >>>> sends >> >>>>> TASK_LOST events for each task on the agent. >> >>>>> https://github.com/apache/mesos/blob/33093c893773f8c9d293afe >> >>>>> 38e9909f9a2868d32/src/master/master.cpp#L7066-L7107 >> >>>>> Marking agent unreachable after failover does not cause >> >>>>> TASK_LOST events. >> >>>>> https://github.com/apache/mesos/blob/33093c893773f8c9d293afe >> >>>>> 38e9909f9a2868d32/src/master/master.cpp#L2036-L2070 >> >>>>> Once an agent re-registers it sends TASK_LOST events. Agent >> >>>>> sending >> >>>>> TASK_LOST for tasks that it does not know after a Master >> >>>>> failover. >> >>>>> https://github.com/apache/mesos/blob/33093c893773f8c9d293afe >> >>>>> 38e9909f9a2868d32/src/slave/slave.cpp#L1324-L1383 >> >>>> >> >>>> >> >>>> >> >>>> The separate code path for markUnreachableAfterFailover appears >> >>>> to have >> >>>> been added by this commit: >> >>>> https://github.com/apache/mesos/commit/937c85f2f6528d1ac56ea9a7aa >> >>>> 174c >> >>>> a0bd371d0c >> >>>> >> >>>> And I think this totally breaks the promise of introducing the >> >>>> PARTITION_AWARE stuff in a backwards-compatible way. >> >>>> >> >>>> So right now, yes we rely on reconciliation to finally mark the >> >>>> tasks as >> >>>> LOST and reschedule their replacements. >> >>>> >> >>>> I think the only reason we haven't been more impacted by this at >> >>>> Twitter >> >>>> is our Mesos master is remarkably stable (compared to Aurora's >> >>>> daily >> >>>> failovers). >> >>>> >> >>>> We have two paths forward here: push forward and embrace the new >> >>>> partition >> >>>> awareness features in Aurora and/or push back on the above change >> >>>> with the >> >>>> Mesos community and have a better story for non-partition aware >> >>>> APIs in the >> >>>> short term. >> >>>> >> >>>> >> >>>> >> >>>> On Sat, Jul 15, 2017 at 2:01 AM, Meghdoot bhattacharya < >> >>>> meghdoo...@yahoo.com.invalid> wrote: >> >>>> >> >>>>> We can reproduce it easily as the steps are >> >>>>> 1. Shut down leading mesos master >> >>>>> 2. Shutdown agent at same time >> >>>>> 3. Wait for 10 mins >> >>>>> >> >>>>> What Renan and I saw in the logs were only agent lost and not >> >>>>> task lost >> >>>>> sent. While in regular health check expire scenario both task >> >>>>> lost and >> >>>>> agent lost were sent. >> >>>>> >> >>>>> So yes this is very concerning. >> >>>>> >> >>>>> Thx >> >>>>> >> >>>>>> On Jul 14, 2017, at 10:28 AM, David McLaughlin <dmclaughlin@a >> >>>>>> pache.org> >> >>>>> >> >>>>> wrote: >> >>>>>> >> >>>>>> It would be interesting to see the logs. I think that will >> >>>>>> tell you if >> >>>>> >> >>>>> the >> >>>>>> Mesos master is: >> >>>>>> >> >>>>>> a) Sending slaveLost >> >>>>>> b) Trying to send TASK_LOST >> >>>>>> >> >>>>>> And then the Scheduler logs (and/or the metrics it exports) >> >>>>>> should tell >> >>>>> >> >>>>> you >> >>>>>> whether those events were received. If this is reproducible, >> >>>>>> I'd >> >>>>> >> >>>>> consider >> >>>>>> it a serious bug. >> >>>>>> >> >>>>>> On Fri, Jul 14, 2017 at 10:04 AM, Meghdoot bhattacharya < >> >>>>>> meghdoo...@yahoo.com.invalid> wrote: >> >>>>>> >> >>>>>>> So in this situation why is not aurora replacing the tasks >> >>>>>>> and waiting >> >>>>> >> >>>>> for >> >>>>>>> external recon to fix it. >> >>>>>>> >> >>>>>>> This is different when the 75 sec (5*15) health check of >> >>>>>>> slave times >> >>>>> >> >>>>> out >> >>>>>>> (no master failover), aurora replaces it on task lost >> >>>>>>> message. >> >>>>>>> >> >>>>>>> Are you hinting we should ask mesos folks why in master >> >>>>>>> fail over >> >>>>>>> reregistration timeout scenario why task lost not sent >> >>>>>>> though slave >> >>>>> >> >>>>> lost >> >>>>>>> sent and from below docs task lost should have been sent. >> >>>>>>> >> >>>>>>> Because either mesos is not sending the right status or >> >>>>>>> aurora is not >> >>>>>>> handling it. >> >>>>>>> >> >>>>>>> Thx >> >>>>>>> >> >>>>>>>> On Jul 14, 2017, at 8:21 AM, David McLaughlin <dmclaughli >> >>>>>>>> n...@apache.org >> >>>>>>> wrote: >> >>>>>>>> >> >>>>>>>> "1. When mesos sends slave lost after 10 mins in this >> >>>>>>>> situation , why >> >>>>>>> >> >>>>>>> does >> >>>>>>>> aurora not act on it?" >> >>>>>>>> >> >>>>>>>> Because Mesos also sends TASK_LOST for every task running >> >>>>>>>> on the agent >> >>>>>>>> whenever it calls slaveLost: >> >>>>>>>> >> >>>>>>>> When it is time to remove an agent, the master removes >> >>>>>>>> the agent from >> >>>>> >> >>>>> the >> >>>>>>>> list of registered agents in the master’s durable state >> >>>>>>>> <http://mesos.apache.org/documentation/latest/replicated- >> >>>>> >> >>>>> log-internals/> >> >>>>>>> (this >> >>>>>>>> will survive master failover). The master sends a >> >>>>>>>> slaveLost callback >> >>>>> >> >>>>> to >> >>>>>>>> every registered scheduler driver; it also sends >> >>>>>>>> TASK_LOST status >> >>>>> >> >>>>> updates >> >>>>>>>> for every task that was running on the removed agent. >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> On Thu, Jul 13, 2017 at 4:32 PM, meghdoot bhattacharya < >> >>>>>>>> meghdoo...@yahoo.com.invalid> wrote: >> >>>>>>>> >> >>>>>>>>> We were investigation slave re registration behavior on >> >>>>>>>>> master fail >> >>>>> >> >>>>> over >> >>>>>>>>> in Aurora 0.17 with mesos 1.1. >> >>>>>>>>> Few important points >> >>>>>>>>> http://mesos.apache.org/documentation/latest/high- >> >>>>>>>>> availability-framework-guide/ (If an agent does not >> >>>>>>>>> reregister with >> >>>>> >> >>>>> the >> >>>>>>>>> new master within a timeout (controlled by the >> >>>>>>> >> >>>>>>> --agent_reregister_timeout >> >>>>>>>>> configuration flag), the master marks the agent as >> >>>>>>>>> failed and follows >> >>>>>>> >> >>>>>>> the >> >>>>>>>>> same steps described above. However, there is one >> >>>>>>>>> difference: by >> >>>>>>> >> >>>>>>> default, >> >>>>>>>>> agents are allowed to reconnect following master >> >>>>>>>>> failover, even after >> >>>>>>> >> >>>>>>> the >> >>>>>>>>> agent_reregister_timeout has fired. This means that >> >>>>>>>>> frameworks might >> >>>>>>> >> >>>>>>> see a >> >>>>>>>>> TASK_LOST update for a task but then later discover >> >>>>>>>>> that the task is >> >>>>>>>>> running (because the agent where it was running was >> >>>>>>>>> allowed to >> >>>>>>> >> >>>>>>> reconnect). >> >>>>>>>>> http://mesos.apache.org/documentation/latest/reconcilia >> >>>>>>>>> tion/ >> >>>>> >> >>>>> (Implicit >> >>>>>>>>> reconciliation (passing an empty list) should also be >> >>>>>>>>> used >> >>>>>>> >> >>>>>>> periodically, as >> >>>>>>>>> a defense against data loss in the framework. Unless a >> >>>>>>>>> strict >> >>>>> >> >>>>> registry >> >>>>>>> is >> >>>>>>>>> in use on the master, its possible for tasks to >> >>>>>>>>> resurrect from a LOST >> >>>>>>> >> >>>>>>> state >> >>>>>>>>> (without a strict registry the master does not enforce >> >>>>>>>>> agent removal >> >>>>>>> >> >>>>>>> across >> >>>>>>>>> failovers). When an unknown task is encountered, the >> >>>>>>>>> scheduler should >> >>>>>>> >> >>>>>>> kill >> >>>>>>>>> or recover the task.) >> >>>>>>>>> https://issues.apache.org/jira/browse/MESOS-5951(Remove >> >>>>>>>>> s strict >> >>>>>>> >> >>>>>>> registry >> >>>>>>>>> mode flag from 1.1 and reverts to the old behavior of >> >>>>>>>>> non strict >> >>>>>>> >> >>>>>>> registry >> >>>>>>>>> mode where tasks and executors were not killed on agent >> >>>>> >> >>>>> reregistration >> >>>>>>>>> timeout on master failover) >> >>>>>>>>> So, what we find, if the slave does not come back after >> >>>>>>>>> 10 mins >> >>>>>>>>> 1. Mesos master sends slave lost but not task lost to >> >>>>>>>>> Aurora.2. >> >>>>> >> >>>>> Aurora >> >>>>>>>>> does not replace the tasks.3. When explicit recon >> >>>>>>>>> starts , then only >> >>>>>>> >> >>>>>>> this >> >>>>>>>>> gets corrected with aurora spawning replacement tasks. >> >>>>>>>>> If slave restarts after 10 mins >> >>>>>>>>> 1. When implicit recon starts, this situation gets >> >>>>>>>>> fixed because in >> >>>>>>> >> >>>>>>> aurora >> >>>>>>>>> it is marked as lost and mesos sends running and those >> >>>>>>>>> get killed and >> >>>>>>>>> replaced. >> >>>>>>>>> So, questions >> >>>>>>>>> 1. When mesos sends slave lost after 10 mins in this >> >>>>>>>>> situation , why >> >>>>>>> >> >>>>>>> does >> >>>>>>>>> aurora not act on it?2. As per recon docs best >> >>>>>>>>> practices, explicit >> >>>>> >> >>>>> recon >> >>>>>>>>> should start followed by implicit recon on master >> >>>>>>>>> failover. Looks >> >>>>> >> >>>>> like >> >>>>>>>>> aurora is not doing that and the regular hourly recons >> >>>>>>>>> are running >> >>>>> >> >>>>> with >> >>>>>>> 30 >> >>>>>>>>> min spread between explicit and implicit. Should aurora >> >>>>>>>>> do recon on >> >>>>>>> >> >>>>>>> master >> >>>>>>>>> fail over? >> >>>>>>>>> >> >>>>>>>>> General questions1. What is the effect on aurora if we >> >>>>>>>>> make explicit >> >>>>>>> >> >>>>>>> recon >> >>>>>>>>> every 15 mins instead of default 1 hr? Does it slow >> >>>>>>>>> down scheduling, >> >>>>>>> >> >>>>>>> does >> >>>>>>>>> snapshot creation gets delayed etc? >> >>>>>>>>> 2. Any issue if spread between explicit recon and >> >>>>>>>>> implicit recon >> >>>>> >> >>>>> brought >> >>>>>>>>> down to 2 mins from 30 mins? probably depend on 1. >> >>>>>>>>> Thx >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>> >> >>>>> >> >> >> >> >> >> >