> On June 6, 2016, 5:43 a.m., Jiang Yan Xu wrote: > > src/tests/master_tests.cpp, line 1760 > > <https://reviews.apache.org/r/47082/diff/2/?file=1395974#file1395974line1760> > > > > So the following two tests actually caught something we didn't > > anticipate, so instead of "fixing" the tests, we should fix our code: > > > > The scheduler driver remembers the slave pids for framework messages > > and old entries are removed when it receives `LostSlaveMessage`s. The way > > we are doing it right now can cause them to be nevered removed because they > > don't receive the `LostSlaveMessage`s! > > > > Note that there are two cases: > > > > 1. If the master fails over, it doesn't know anything about the lost > > agent so it doesn't know whehter it should send `LostSlaveMessage`s but it > > perahps should. > > 2. If the master doesn't fail over and a framework's tasks have all > > completed before the agent goes lost, it doesn't send `LostSlaveMessage` to > > it but then the save pid is not erased on the driver. > > > > Let's chat about this further.
Fixed case 2. See https://reviews.apache.org/r/48453/. UPIDs are tracked for each executor by adjusting that entry's life to be from the time the executor is launched till the time the executor terminates. As a result, not sending LostSlaveMessage to frameworks who have no tasks or reservations on that agent would not interfere with cleaning up of slave UPIDs. Regarding case 1: When the master fails over, we send a LostSlaveMessage if it has a task or reservation or pending offer to the framework, and not otherwise. I think that should be fine. Let us chat about this case a bit more. - Anindya ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/47082/#review135384 ----------------------------------------------------------- On May 26, 2016, 11:56 p.m., Anindya Sinha wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/47082/ > ----------------------------------------------------------- > > (Updated May 26, 2016, 11:56 p.m.) > > > Review request for mesos and Jiang Yan Xu. > > > Bugs: MESOS-5143 > https://issues.apache.org/jira/browse/MESOS-5143 > > > Repository: mesos > > > Description > ------- > > When a slave is removed, master sends a LostSlaveMessage to affected > frameworks only (instead of all registered frameworks). An affected > framework is a framework which satisfied one or more conditions of > the following: > > 1. There are running tasks on this slave belonging to the framework. > 2. There are pending tasks on this slave belonging to the framework. > 3. Reserved resources on the slave have a matching role with the > role of the framework. > 4. There are pending offers or pending inverse offers from this slave > for the framework. > > > Diffs > ----- > > src/master/master.hpp 1a875c32eddfb6d884e3d0dda7f5716ee53966c3 > src/master/master.cpp 0005a29caabcc6a3776037cf86a2b12660e6377b > src/tests/master_tests.cpp 34be015aa314a7574e9065efb7b1bb8e1570c5b7 > > Diff: https://reviews.apache.org/r/47082/diff/ > > > Testing > ------- > > All existing and modified tests passed. > > > Thanks, > > Anindya Sinha > >