Re: Agent reregistration timeout, no TASK_LOST messages

Neil Conway Mon, 17 Jul 2017 09:35:34 -0700

On Mon, Jul 17, 2017 at 9:20 AM, Ilya Pronin <[email protected]>
wrote:


> AFAIK the absence of TASK_LOST statuses is expected. Master registry
> persists information only about agents. Tasks are recovered from
> re-registering agents. Because of that the failed over master can't send
> TASK_LOST for tasks that were running on the agent that didn't re-register,
> it simply doesn't know about them. The only thing the master can do in this
> situation is send LostSlaveMessage that will tell the scheduler that tasks
> on this agent are LOST/UNREACHABLE.
>

+1.

The situation where the agent came back after reregistration timeout
> doesn't sound good. The only way for the framework to learn about tasks
> that are still running on such agent is either from status updates or via
> implicit reconciliation. Perhaps, the master could send updates for tasks
> it learned about when such agent is readmitted?
>

I agree this would be a good idea:
https://issues.apache.org/jira/browse/MESOS-6406

I haven't had a chance to implement it yet, but if someone is interested, I
think this would be a pretty nicely scoped project.

Neil

Re: Agent reregistration timeout, no TASK_LOST messages

Reply via email to