Great! It's probably worthwhile to improve our reconciliation doc to suggest that frameworks do explicit reconciliation on slave lost messages as well.
On Tue, Jul 18, 2017 at 12:27 PM, Ilya Pronin <ipro...@twopensource.com> wrote: > Vinod, sure, I'd like to. I'll also look into MESOS-6406 that Neil > mentioned, when I have time. If no one does that before :) > > On Tue, Jul 18, 2017 at 1:14 AM, Vinod Kone <vinodk...@apache.org> wrote: > > > On Mon, Jul 17, 2017 at 2:55 PM, Meghdoot bhattacharya < > > meghdoo...@yahoo.com.invalid> wrote: > > > > > When there is no master fail over and agents join back after the > default > > > 5*15 timeout, we do see tasks getting killed like it used to. Because > in > > > this case master has sent task lost to framework. > > > But we are noticing shutdown() executor callback not getting invoked. > We > > > started a different thread on it. This is mesos 1.1. > > > > > > Are you trying to say tasks will leak in latest versions and again > relies > > > on recon for the regular health check timeout scenario and agent > joining > > > back? > > > > > > > There should be no task leaks. After partition awareness code has landed, > > the master no longer shuts down the agents in the above scenario but it > > still shuts down the tasks/executors of the non-partition-aware > frameworks. > > So the observable behavior for a framework regarding its tasks/executors > > should not change. The one observable change is that frameworks do not > get > > `LostSlaveMessage` (`lostSlave()` callback on the driver) in this case. > > > > -- > Ilya Pronin >