On Mon, Jul 17, 2017 at 9:20 AM, Ilya Pronin <ipro...@twopensource.com> wrote:
> AFAIK the absence of TASK_LOST statuses is expected. Master registry > persists information only about agents. Tasks are recovered from > re-registering agents. Because of that the failed over master can't send > TASK_LOST for tasks that were running on the agent that didn't re-register, > it simply doesn't know about them. The only thing the master can do in this > situation is send LostSlaveMessage that will tell the scheduler that tasks > on this agent are LOST/UNREACHABLE. > +1. The situation where the agent came back after reregistration timeout > doesn't sound good. The only way for the framework to learn about tasks > that are still running on such agent is either from status updates or via > implicit reconciliation. Perhaps, the master could send updates for tasks > it learned about when such agent is readmitted? > I agree this would be a good idea: https://issues.apache.org/jira/browse/MESOS-6406 I haven't had a chance to implement it yet, but if someone is interested, I think this would be a pretty nicely scoped project. Neil