It would be interesting to see the logs. I think that will tell you if the Mesos master is:
a) Sending slaveLost b) Trying to send TASK_LOST And then the Scheduler logs (and/or the metrics it exports) should tell you whether those events were received. If this is reproducible, I'd consider it a serious bug. On Fri, Jul 14, 2017 at 10:04 AM, Meghdoot bhattacharya < meghdoo...@yahoo.com.invalid> wrote: > So in this situation why is not aurora replacing the tasks and waiting for > external recon to fix it. > > This is different when the 75 sec (5*15) health check of slave times out > (no master failover), aurora replaces it on task lost message. > > Are you hinting we should ask mesos folks why in master fail over > reregistration timeout scenario why task lost not sent though slave lost > sent and from below docs task lost should have been sent. > > Because either mesos is not sending the right status or aurora is not > handling it. > > Thx > > > On Jul 14, 2017, at 8:21 AM, David McLaughlin <dmclaugh...@apache.org> > wrote: > > > > "1. When mesos sends slave lost after 10 mins in this situation , why > does > > aurora not act on it?" > > > > Because Mesos also sends TASK_LOST for every task running on the agent > > whenever it calls slaveLost: > > > > When it is time to remove an agent, the master removes the agent from the > > list of registered agents in the master’s durable state > > <http://mesos.apache.org/documentation/latest/replicated-log-internals/> > (this > > will survive master failover). The master sends a slaveLost callback to > > every registered scheduler driver; it also sends TASK_LOST status updates > > for every task that was running on the removed agent. > > > > > > > > > > On Thu, Jul 13, 2017 at 4:32 PM, meghdoot bhattacharya < > > meghdoo...@yahoo.com.invalid> wrote: > > > >> We were investigation slave re registration behavior on master fail over > >> in Aurora 0.17 with mesos 1.1. > >> Few important points > >> http://mesos.apache.org/documentation/latest/high- > >> availability-framework-guide/ (If an agent does not reregister with the > >> new master within a timeout (controlled by the > --agent_reregister_timeout > >> configuration flag), the master marks the agent as failed and follows > the > >> same steps described above. However, there is one difference: by > default, > >> agents are allowed to reconnect following master failover, even after > the > >> agent_reregister_timeout has fired. This means that frameworks might > see a > >> TASK_LOST update for a task but then later discover that the task is > >> running (because the agent where it was running was allowed to > reconnect). > >> http://mesos.apache.org/documentation/latest/reconciliation/(Implicit > >> reconciliation (passing an empty list) should also be used > periodically, as > >> a defense against data loss in the framework. Unless a strict registry > is > >> in use on the master, its possible for tasks to resurrect from a LOST > state > >> (without a strict registry the master does not enforce agent removal > across > >> failovers). When an unknown task is encountered, the scheduler should > kill > >> or recover the task.) > >> https://issues.apache.org/jira/browse/MESOS-5951(Removes strict > registry > >> mode flag from 1.1 and reverts to the old behavior of non strict > registry > >> mode where tasks and executors were not killed on agent reregistration > >> timeout on master failover) > >> So, what we find, if the slave does not come back after 10 mins > >> 1. Mesos master sends slave lost but not task lost to Aurora.2. Aurora > >> does not replace the tasks.3. When explicit recon starts , then only > this > >> gets corrected with aurora spawning replacement tasks. > >> If slave restarts after 10 mins > >> 1. When implicit recon starts, this situation gets fixed because in > aurora > >> it is marked as lost and mesos sends running and those get killed and > >> replaced. > >> So, questions > >> 1. When mesos sends slave lost after 10 mins in this situation , why > does > >> aurora not act on it?2. As per recon docs best practices, explicit recon > >> should start followed by implicit recon on master failover. Looks like > >> aurora is not doing that and the regular hourly recons are running with > 30 > >> min spread between explicit and implicit. Should aurora do recon on > master > >> fail over? > >> > >> General questions1. What is the effect on aurora if we make explicit > recon > >> every 15 mins instead of default 1 hr? Does it slow down scheduling, > does > >> snapshot creation gets delayed etc? > >> 2. Any issue if spread between explicit recon and implicit recon brought > >> down to 2 mins from 30 mins? probably depend on 1. > >> Thx > >> > >> > >> > >