Re: Agent reregistration timeout, no TASK_LOST messages

David McLaughlin Mon, 17 Jul 2017 14:48:00 -0700

Sorry, I misread this. Thanks for the explanation, it makes sense now.

I guess reconciliation is the only way to handle this.


It would be good to update the docs to reflect the new/current behavior.

On 2017-07-17 09:20 (-0700), Ilya Pronin <[email protected]> wrote:
> Hi,>
>
> AFAIK the absence of TASK_LOST statuses is expected. Master registry>
> persists information only about agents. Tasks are recovered from>
> re-registering agents. Because of that the failed over master can't send>
> TASK_LOST for tasks that were running on the agent that didn't
re-register,>
> it simply doesn't know about them. The only thing the master can do in
this>
> situation is send LostSlaveMessage that will tell the scheduler that
tasks>
> on this agent are LOST/UNREACHABLE.>
>
> The situation where the agent came back after reregistration timeout>
> doesn't sound good. The only way for the framework to learn about tasks>
> that are still running on such agent is either from status updates or
via>
> implicit reconciliation. Perhaps, the master could send updates for
tasks>
> it learned about when such agent is readmitted?>
>
> On Sun, Jul 16, 2017 at 5:54 AM, Meghdoot bhattacharya <>
> [email protected]> wrote:>
>
> > This looks like a serious bug unless we are missing something. Hoping
for>
> > clarifications.>
> >>
> > Thx>
> >>
> > > On Jul 14, 2017, at 3:52 PM, Renan DelValle <[email protected]>>
> > wrote:>
> > >>
> > > Hi all,>
> > >>
> > > We're using Mesos 1.1.0 and have observed some unexpected behavior
with>
> > > regards to Agent reregistration on our cluster.>
> > >>
> > > When a health check failure happens, our framework (in this case
Apache>
> > > Aurora) receives an Agent Lost message along with TASK_LOST messages
for>
> > > each of the tasks that was currently running on the agent that failed
the>
> > > health check (not responding after *max_agent_ping_timeouts*).>
> > >>
> > > We expected the same behavior to take place when an Agent does not>
> > register>
> > > before the *agent_reregister_timeout* is up. However, while our
framework>
> > > did receive an Agent Lost message after 10 minutes had passed
(default>
> > > agent_reregister_timeout value) since leader election, it did not
receive>
> > > any messages concerning the tasks that were running on that node.>
> > >>
> > > This can create a scenario where, if the Agent goes away permanently,
we>
> > > have tasks that are unaccounted for and won't be restarted on
another>
> > Agent>
> > > until an explicit reconciliation is done.>
> > >>
> > > On the other hand, if the Agent does come back after the reregister>
> > > timeout, and the framework has replaced the missing instances, the>
> > > instances that were previously running will continue to run until an>
> > > implicit reconciliation is done.>
> > >>
> > > I understand some behavior may have changed with partition aware>
> > > frameworks, so I'm trying to understand if this is the expected
behavior.>
> > >>
> > > For what is worth, Aurora is not a partition aware framework.>
> > >>
> > > Any help would be appreciated,>
> > >>
> > > Thanks!>
> > > -Renan>
> >>
> >>
> -- >
> Ilya Pronin>
>

Re: Agent reregistration timeout, no TASK_LOST messages

Reply via email to