Hi,

AFAIK the absence of TASK_LOST statuses is expected. Master registry
persists information only about agents. Tasks are recovered from
re-registering agents. Because of that the failed over master can't send
TASK_LOST for tasks that were running on the agent that didn't re-register,
it simply doesn't know about them. The only thing the master can do in this
situation is send LostSlaveMessage that will tell the scheduler that tasks
on this agent are LOST/UNREACHABLE.

The situation where the agent came back after reregistration timeout
doesn't sound good. The only way for the framework to learn about tasks
that are still running on such agent is either from status updates or via
implicit reconciliation. Perhaps, the master could send updates for tasks
it learned about when such agent is readmitted?

On Sun, Jul 16, 2017 at 5:54 AM, Meghdoot bhattacharya <
meghdoo...@yahoo.com.invalid> wrote:

> This looks like a serious bug unless we are missing something. Hoping for
> clarifications.
>
> Thx
>
> > On Jul 14, 2017, at 3:52 PM, Renan DelValle <rdelv...@binghamton.edu>
> wrote:
> >
> > Hi all,
> >
> > We're using Mesos 1.1.0 and have observed some unexpected behavior with
> > regards to Agent reregistration on our cluster.
> >
> > When a health check failure happens, our framework (in this case Apache
> > Aurora) receives an Agent Lost message along with TASK_LOST messages for
> > each of the tasks that was currently running on the agent that failed the
> > health check (not responding after *max_agent_ping_timeouts*).
> >
> > We expected the same behavior to take place when an Agent does not
> register
> > before the *agent_reregister_timeout* is up. However, while our framework
> > did receive an Agent Lost message after 10 minutes had passed (default
> > agent_reregister_timeout value) since leader election, it did not receive
> > any messages concerning the tasks that were running on that node.
> >
> > This can create a scenario where, if the Agent goes away permanently, we
> > have tasks that are unaccounted for and won't be restarted on another
> Agent
> > until an explicit reconciliation is done.
> >
> > On the other hand, if the Agent does come back after the reregister
> > timeout, and the framework has replaced the missing instances, the
> > instances that were previously running will continue to run until an
> > implicit reconciliation is done.
> >
> > I understand some behavior may have changed with partition aware
> > frameworks, so I'm trying to understand if this is the expected behavior.
> >
> > For what is worth, Aurora is not a partition aware framework.
> >
> > Any help would be appreciated,
> >
> > Thanks!
> > -Renan
>
>
-- 
Ilya Pronin

Reply via email to