Re: Agent reregistration timeout, no TASK_LOST messages

David McLaughlin Mon, 17 Jul 2017 11:24:05 -0700

Not sending TASK_LOST is a breaking change compared to previous behavior.
>From the docs here:


http://mesos.apache.org/documentation/latest/high-
availability-framework-guide/

When it is time to remove an agent, the master removes the agent from the
> list of registered agents in the master’s durable state
> <http://mesos.apache.org/documentation/latest/replicated-log-internals/> (this
> will survive master failover). The master sends a slaveLost callback to
> every registered scheduler driver; it also sends TASK_LOST status updates
> for every task that was running on the removed agent.


And then from the section on agent reregistration:

If an agent does not reregister with the new master within a timeout
> (controlled by the --agent_reregister_timeout configuration flag),*the
> master marks the agent as failed and follows the same steps described above*.
> However, there is one difference: by default, agents are *allowed to
> reconnect* following master failover, even after the
> agent_reregister_timeout has fired. This means that frameworks might see
> a TASK_LOST update for a task but then later discover that the task is
> running (because the agent where it was running was allowed to reconnect).
>


Clearly the idea was that frameworks would see TASK_LOST every time the
agent is marked as lost.

This behavior appears to have been broken by this commit:

https://github.com/apache/mesos/commit/937c85f2f6528d1ac56ea
9a7aa174ca0bd371d0c

Reconciliation is still required because message delivery is best-effort,
but the fundamental difference is now frameworks *rely* on reconciliation
for basic operation. We have plans to eventually adopt partition-awareness
into Aurora, but IMO this change in behavior was an oversight when trying
to maintain backwards compatibility and can be (harmlessly) fixed in Mesos.

Cheers,
David

On 2017-07-17 09:20 (-0700), Ilya Pronin <[email protected]> wrote:
> Hi,>
>
> AFAIK the absence of TASK_LOST statuses is expected. Master registry>
> persists information only about agents. Tasks are recovered from>
> re-registering agents. Because of that the failed over master can't send>
> TASK_LOST for tasks that were running on the agent that didn't
re-register,>
> it simply doesn't know about them. The only thing the master can do in
this>
> situation is send LostSlaveMessage that will tell the scheduler that
tasks>
> on this agent are LOST/UNREACHABLE.>
>
> The situation where the agent came back after reregistration timeout>
> doesn't sound good. The only way for the framework to learn about tasks>
> that are still running on such agent is either from status updates or
via>
> implicit reconciliation. Perhaps, the master could send updates for
tasks>
> it learned about when such agent is readmitted?>
>
> On Sun, Jul 16, 2017 at 5:54 AM, Meghdoot bhattacharya <>
> [email protected]> wrote:>
>
> > This looks like a serious bug unless we are missing something. Hoping
for>
> > clarifications.>
> >>
> > Thx>
> >>
> > > On Jul 14, 2017, at 3:52 PM, Renan DelValle <[email protected]>>
> > wrote:>
> > >>
> > > Hi all,>
> > >>
> > > We're using Mesos 1.1.0 and have observed some unexpected behavior
with>
> > > regards to Agent reregistration on our cluster.>
> > >>
> > > When a health check failure happens, our framework (in this case
Apache>
> > > Aurora) receives an Agent Lost message along with TASK_LOST messages
for>
> > > each of the tasks that was currently running on the agent that failed
the>
> > > health check (not responding after *max_agent_ping_timeouts*).>
> > >>
> > > We expected the same behavior to take place when an Agent does not>
> > register>
> > > before the *agent_reregister_timeout* is up. However, while our
framework>
> > > did receive an Agent Lost message after 10 minutes had passed
(default>
> > > agent_reregister_timeout value) since leader election, it did not
receive>
> > > any messages concerning the tasks that were running on that node.>
> > >>
> > > This can create a scenario where, if the Agent goes away permanently,
we>
> > > have tasks that are unaccounted for and won't be restarted on
another>
> > Agent>
> > > until an explicit reconciliation is done.>
> > >>
> > > On the other hand, if the Agent does come back after the reregister>
> > > timeout, and the framework has replaced the missing instances, the>
> > > instances that were previously running will continue to run until an>
> > > implicit reconciliation is done.>
> > >>
> > > I understand some behavior may have changed with partition aware>
> > > frameworks, so I'm trying to understand if this is the expected
behavior.>
> > >>
> > > For what is worth, Aurora is not a partition aware framework.>
> > >>
> > > Any help would be appreciated,>
> > >>
> > > Thanks!>
> > > -Renan>
> >>
> >>
> -- >
> Ilya Pronin>
>

Re: Agent reregistration timeout, no TASK_LOST messages

Reply via email to