I've left a comment on the initial RB detailing how the change broke backwards-compatibility. Given that the tasks are marked as lost as soon as the agent reregisters after slaveLost is sent anyway, there doesn't seem to be any reason not to send TASK_LOST too. I think this should be an easy fix.
On Sat, Jul 15, 2017 at 9:21 AM, David McLaughlin <dmclaugh...@apache.org> wrote: > Yes, we've confirmed this internally too (Santhosh did the work here): > > When an agent becomes unreachable while the master is running, it sends >> TASK_LOST events for each task on the agent. >> https://github.com/apache/mesos/blob/33093c893773f8c9d293afe >> 38e9909f9a2868d32/src/master/master.cpp#L7066-L7107 >> Marking agent unreachable after failover does not cause TASK_LOST events. >> https://github.com/apache/mesos/blob/33093c893773f8c9d293afe >> 38e9909f9a2868d32/src/master/master.cpp#L2036-L2070 >> Once an agent re-registers it sends TASK_LOST events. Agent sending >> TASK_LOST for tasks that it does not know after a Master failover. >> https://github.com/apache/mesos/blob/33093c893773f8c9d293afe >> 38e9909f9a2868d32/src/slave/slave.cpp#L1324-L1383 > > > > The separate code path for markUnreachableAfterFailover appears to have > been added by this commit: > https://github.com/apache/mesos/commit/937c85f2f6528d1ac56ea9a7aa174c > a0bd371d0c > > And I think this totally breaks the promise of introducing the > PARTITION_AWARE stuff in a backwards-compatible way. > > So right now, yes we rely on reconciliation to finally mark the tasks as > LOST and reschedule their replacements. > > I think the only reason we haven't been more impacted by this at Twitter > is our Mesos master is remarkably stable (compared to Aurora's daily > failovers). > > We have two paths forward here: push forward and embrace the new partition > awareness features in Aurora and/or push back on the above change with the > Mesos community and have a better story for non-partition aware APIs in the > short term. > > > > On Sat, Jul 15, 2017 at 2:01 AM, Meghdoot bhattacharya < > meghdoo...@yahoo.com.invalid> wrote: > >> We can reproduce it easily as the steps are >> 1. Shut down leading mesos master >> 2. Shutdown agent at same time >> 3. Wait for 10 mins >> >> What Renan and I saw in the logs were only agent lost and not task lost >> sent. While in regular health check expire scenario both task lost and >> agent lost were sent. >> >> So yes this is very concerning. >> >> Thx >> >> > On Jul 14, 2017, at 10:28 AM, David McLaughlin <dmclaugh...@apache.org> >> wrote: >> > >> > It would be interesting to see the logs. I think that will tell you if >> the >> > Mesos master is: >> > >> > a) Sending slaveLost >> > b) Trying to send TASK_LOST >> > >> > And then the Scheduler logs (and/or the metrics it exports) should tell >> you >> > whether those events were received. If this is reproducible, I'd >> consider >> > it a serious bug. >> > >> > On Fri, Jul 14, 2017 at 10:04 AM, Meghdoot bhattacharya < >> > meghdoo...@yahoo.com.invalid> wrote: >> > >> >> So in this situation why is not aurora replacing the tasks and waiting >> for >> >> external recon to fix it. >> >> >> >> This is different when the 75 sec (5*15) health check of slave times >> out >> >> (no master failover), aurora replaces it on task lost message. >> >> >> >> Are you hinting we should ask mesos folks why in master fail over >> >> reregistration timeout scenario why task lost not sent though slave >> lost >> >> sent and from below docs task lost should have been sent. >> >> >> >> Because either mesos is not sending the right status or aurora is not >> >> handling it. >> >> >> >> Thx >> >> >> >>> On Jul 14, 2017, at 8:21 AM, David McLaughlin <dmclaugh...@apache.org >> > >> >> wrote: >> >>> >> >>> "1. When mesos sends slave lost after 10 mins in this situation , why >> >> does >> >>> aurora not act on it?" >> >>> >> >>> Because Mesos also sends TASK_LOST for every task running on the agent >> >>> whenever it calls slaveLost: >> >>> >> >>> When it is time to remove an agent, the master removes the agent from >> the >> >>> list of registered agents in the master’s durable state >> >>> <http://mesos.apache.org/documentation/latest/replicated- >> log-internals/> >> >> (this >> >>> will survive master failover). The master sends a slaveLost callback >> to >> >>> every registered scheduler driver; it also sends TASK_LOST status >> updates >> >>> for every task that was running on the removed agent. >> >>> >> >>> >> >>> >> >>> >> >>> On Thu, Jul 13, 2017 at 4:32 PM, meghdoot bhattacharya < >> >>> meghdoo...@yahoo.com.invalid> wrote: >> >>> >> >>>> We were investigation slave re registration behavior on master fail >> over >> >>>> in Aurora 0.17 with mesos 1.1. >> >>>> Few important points >> >>>> http://mesos.apache.org/documentation/latest/high- >> >>>> availability-framework-guide/ (If an agent does not reregister with >> the >> >>>> new master within a timeout (controlled by the >> >> --agent_reregister_timeout >> >>>> configuration flag), the master marks the agent as failed and follows >> >> the >> >>>> same steps described above. However, there is one difference: by >> >> default, >> >>>> agents are allowed to reconnect following master failover, even after >> >> the >> >>>> agent_reregister_timeout has fired. This means that frameworks might >> >> see a >> >>>> TASK_LOST update for a task but then later discover that the task is >> >>>> running (because the agent where it was running was allowed to >> >> reconnect). >> >>>> http://mesos.apache.org/documentation/latest/reconciliation/ >> (Implicit >> >>>> reconciliation (passing an empty list) should also be used >> >> periodically, as >> >>>> a defense against data loss in the framework. Unless a strict >> registry >> >> is >> >>>> in use on the master, its possible for tasks to resurrect from a LOST >> >> state >> >>>> (without a strict registry the master does not enforce agent removal >> >> across >> >>>> failovers). When an unknown task is encountered, the scheduler should >> >> kill >> >>>> or recover the task.) >> >>>> https://issues.apache.org/jira/browse/MESOS-5951(Removes strict >> >> registry >> >>>> mode flag from 1.1 and reverts to the old behavior of non strict >> >> registry >> >>>> mode where tasks and executors were not killed on agent >> reregistration >> >>>> timeout on master failover) >> >>>> So, what we find, if the slave does not come back after 10 mins >> >>>> 1. Mesos master sends slave lost but not task lost to Aurora.2. >> Aurora >> >>>> does not replace the tasks.3. When explicit recon starts , then only >> >> this >> >>>> gets corrected with aurora spawning replacement tasks. >> >>>> If slave restarts after 10 mins >> >>>> 1. When implicit recon starts, this situation gets fixed because in >> >> aurora >> >>>> it is marked as lost and mesos sends running and those get killed and >> >>>> replaced. >> >>>> So, questions >> >>>> 1. When mesos sends slave lost after 10 mins in this situation , why >> >> does >> >>>> aurora not act on it?2. As per recon docs best practices, explicit >> recon >> >>>> should start followed by implicit recon on master failover. Looks >> like >> >>>> aurora is not doing that and the regular hourly recons are running >> with >> >> 30 >> >>>> min spread between explicit and implicit. Should aurora do recon on >> >> master >> >>>> fail over? >> >>>> >> >>>> General questions1. What is the effect on aurora if we make explicit >> >> recon >> >>>> every 15 mins instead of default 1 hr? Does it slow down scheduling, >> >> does >> >>>> snapshot creation gets delayed etc? >> >>>> 2. Any issue if spread between explicit recon and implicit recon >> brought >> >>>> down to 2 mins from 30 mins? probably depend on 1. >> >>>> Thx >> >>>> >> >>>> >> >>>> >> >> >> >> >> >> >