Not sending TASK_LOST is a breaking change compared to previous behavior. >From the docs here:
http://mesos.apache.org/documentation/latest/high- availability-framework-guide/ When it is time to remove an agent, the master removes the agent from the > list of registered agents in the master’s durable state > <http://mesos.apache.org/documentation/latest/replicated-log-internals/> (this > will survive master failover). The master sends a slaveLost callback to > every registered scheduler driver; it also sends TASK_LOST status updates > for every task that was running on the removed agent. And then from the section on agent reregistration: If an agent does not reregister with the new master within a timeout > (controlled by the --agent_reregister_timeout configuration flag),*the > master marks the agent as failed and follows the same steps described above*. > However, there is one difference: by default, agents are *allowed to > reconnect* following master failover, even after the > agent_reregister_timeout has fired. This means that frameworks might see > a TASK_LOST update for a task but then later discover that the task is > running (because the agent where it was running was allowed to reconnect). > Clearly the idea was that frameworks would see TASK_LOST every time the agent is marked as lost. This behavior appears to have been broken by this commit: https://github.com/apache/mesos/commit/937c85f2f6528d1ac56ea 9a7aa174ca0bd371d0c Reconciliation is still required because message delivery is best-effort, but the fundamental difference is now frameworks *rely* on reconciliation for basic operation. We have plans to eventually adopt partition-awareness into Aurora, but IMO this change in behavior was an oversight when trying to maintain backwards compatibility and can be (harmlessly) fixed in Mesos. Cheers, David On 2017-07-17 09:20 (-0700), Ilya Pronin <i...@twopensource.com> wrote: > Hi,> > > AFAIK the absence of TASK_LOST statuses is expected. Master registry> > persists information only about agents. Tasks are recovered from> > re-registering agents. Because of that the failed over master can't send> > TASK_LOST for tasks that were running on the agent that didn't re-register,> > it simply doesn't know about them. The only thing the master can do in this> > situation is send LostSlaveMessage that will tell the scheduler that tasks> > on this agent are LOST/UNREACHABLE.> > > The situation where the agent came back after reregistration timeout> > doesn't sound good. The only way for the framework to learn about tasks> > that are still running on such agent is either from status updates or via> > implicit reconciliation. Perhaps, the master could send updates for tasks> > it learned about when such agent is readmitted?> > > On Sun, Jul 16, 2017 at 5:54 AM, Meghdoot bhattacharya <> > meghdoo...@yahoo.com.invalid> wrote:> > > > This looks like a serious bug unless we are missing something. Hoping for> > > clarifications.> > >> > > Thx> > >> > > > On Jul 14, 2017, at 3:52 PM, Renan DelValle <rd...@binghamton.edu>> > > wrote:> > > >> > > > Hi all,> > > >> > > > We're using Mesos 1.1.0 and have observed some unexpected behavior with> > > > regards to Agent reregistration on our cluster.> > > >> > > > When a health check failure happens, our framework (in this case Apache> > > > Aurora) receives an Agent Lost message along with TASK_LOST messages for> > > > each of the tasks that was currently running on the agent that failed the> > > > health check (not responding after *max_agent_ping_timeouts*).> > > >> > > > We expected the same behavior to take place when an Agent does not> > > register> > > > before the *agent_reregister_timeout* is up. However, while our framework> > > > did receive an Agent Lost message after 10 minutes had passed (default> > > > agent_reregister_timeout value) since leader election, it did not receive> > > > any messages concerning the tasks that were running on that node.> > > >> > > > This can create a scenario where, if the Agent goes away permanently, we> > > > have tasks that are unaccounted for and won't be restarted on another> > > Agent> > > > until an explicit reconciliation is done.> > > >> > > > On the other hand, if the Agent does come back after the reregister> > > > timeout, and the framework has replaced the missing instances, the> > > > instances that were previously running will continue to run until an> > > > implicit reconciliation is done.> > > >> > > > I understand some behavior may have changed with partition aware> > > > frameworks, so I'm trying to understand if this is the expected behavior.> > > >> > > > For what is worth, Aurora is not a partition aware framework.> > > >> > > > Any help would be appreciated,> > > >> > > > Thanks!> > > > -Renan> > >> > >> > -- > > Ilya Pronin> >