Re: Agent reregistration timeout, no TASK_LOST messages

Ilya Pronin Tue, 18 Jul 2017 12:28:41 -0700

Vinod, sure, I'd like to. I'll also look into MESOS-6406 that Neil
mentioned, when I have time. If no one does that before :)


On Tue, Jul 18, 2017 at 1:14 AM, Vinod Kone <[email protected]> wrote:

> On Mon, Jul 17, 2017 at 2:55 PM, Meghdoot bhattacharya <
> [email protected]> wrote:
>
> > When there is no master fail over and agents join back after the default
> > 5*15 timeout, we do see tasks getting killed like it used to. Because in
> > this case master has sent task lost to framework.
> > But we are noticing shutdown() executor callback not getting invoked. We
> > started a different thread on it. This is mesos 1.1.
> >
> > Are you trying to say tasks will leak in latest versions and again relies
> > on recon for the regular health check timeout scenario and agent joining
> > back?
> >
>
> There should be no task leaks. After partition awareness code has landed,
> the master no longer shuts down the agents in the above scenario but it
> still shuts down the tasks/executors of the non-partition-aware frameworks.
> So the observable behavior for a framework regarding its tasks/executors
> should not change. The one observable change is that frameworks do not get
> `LostSlaveMessage` (`lostSlave()` callback on the driver) in this case.
>

-- 
Ilya Pronin

Re: Agent reregistration timeout, no TASK_LOST messages

Reply via email to