Re: Agent reregistration timeout, no TASK_LOST messages

Vinod Kone Tue, 18 Jul 2017 12:36:11 -0700

Great!

It's probably worthwhile to improve our reconciliation doc to suggest that
frameworks do explicit reconciliation on slave lost messages as well.


On Tue, Jul 18, 2017 at 12:27 PM, Ilya Pronin <ipro...@twopensource.com>
wrote:

> Vinod, sure, I'd like to. I'll also look into MESOS-6406 that Neil
> mentioned, when I have time. If no one does that before :)
>
> On Tue, Jul 18, 2017 at 1:14 AM, Vinod Kone <vinodk...@apache.org> wrote:
>
> > On Mon, Jul 17, 2017 at 2:55 PM, Meghdoot bhattacharya <
> > meghdoo...@yahoo.com.invalid> wrote:
> >
> > > When there is no master fail over and agents join back after the
> default
> > > 5*15 timeout, we do see tasks getting killed like it used to. Because
> in
> > > this case master has sent task lost to framework.
> > > But we are noticing shutdown() executor callback not getting invoked.
> We
> > > started a different thread on it. This is mesos 1.1.
> > >
> > > Are you trying to say tasks will leak in latest versions and again
> relies
> > > on recon for the regular health check timeout scenario and agent
> joining
> > > back?
> > >
> >
> > There should be no task leaks. After partition awareness code has landed,
> > the master no longer shuts down the agents in the above scenario but it
> > still shuts down the tasks/executors of the non-partition-aware
> frameworks.
> > So the observable behavior for a framework regarding its tasks/executors
> > should not change. The one observable change is that frameworks do not
> get
> > `LostSlaveMessage` (`lostSlave()` callback on the driver) in this case.
> >
>
> --
> Ilya Pronin
>

Re: Agent reregistration timeout, no TASK_LOST messages

Reply via email to