Sorry, I misread this. Thanks for the explanation, it makes sense now. I guess reconciliation is the only way to handle this.
It would be good to update the docs to reflect the new/current behavior. On 2017-07-17 09:20 (-0700), Ilya Pronin <[email protected]> wrote: > Hi,> > > AFAIK the absence of TASK_LOST statuses is expected. Master registry> > persists information only about agents. Tasks are recovered from> > re-registering agents. Because of that the failed over master can't send> > TASK_LOST for tasks that were running on the agent that didn't re-register,> > it simply doesn't know about them. The only thing the master can do in this> > situation is send LostSlaveMessage that will tell the scheduler that tasks> > on this agent are LOST/UNREACHABLE.> > > The situation where the agent came back after reregistration timeout> > doesn't sound good. The only way for the framework to learn about tasks> > that are still running on such agent is either from status updates or via> > implicit reconciliation. Perhaps, the master could send updates for tasks> > it learned about when such agent is readmitted?> > > On Sun, Jul 16, 2017 at 5:54 AM, Meghdoot bhattacharya <> > [email protected]> wrote:> > > > This looks like a serious bug unless we are missing something. Hoping for> > > clarifications.> > >> > > Thx> > >> > > > On Jul 14, 2017, at 3:52 PM, Renan DelValle <[email protected]>> > > wrote:> > > >> > > > Hi all,> > > >> > > > We're using Mesos 1.1.0 and have observed some unexpected behavior with> > > > regards to Agent reregistration on our cluster.> > > >> > > > When a health check failure happens, our framework (in this case Apache> > > > Aurora) receives an Agent Lost message along with TASK_LOST messages for> > > > each of the tasks that was currently running on the agent that failed the> > > > health check (not responding after *max_agent_ping_timeouts*).> > > >> > > > We expected the same behavior to take place when an Agent does not> > > register> > > > before the *agent_reregister_timeout* is up. However, while our framework> > > > did receive an Agent Lost message after 10 minutes had passed (default> > > > agent_reregister_timeout value) since leader election, it did not receive> > > > any messages concerning the tasks that were running on that node.> > > >> > > > This can create a scenario where, if the Agent goes away permanently, we> > > > have tasks that are unaccounted for and won't be restarted on another> > > Agent> > > > until an explicit reconciliation is done.> > > >> > > > On the other hand, if the Agent does come back after the reregister> > > > timeout, and the framework has replaced the missing instances, the> > > > instances that were previously running will continue to run until an> > > > implicit reconciliation is done.> > > >> > > > I understand some behavior may have changed with partition aware> > > > frameworks, so I'm trying to understand if this is the expected behavior.> > > >> > > > For what is worth, Aurora is not a partition aware framework.> > > >> > > > Any help would be appreciated,> > > >> > > > Thanks!> > > > -Renan> > >> > >> > -- > > Ilya Pronin> >
