> On June 6, 2016, 5:43 a.m., Jiang Yan Xu wrote:
> > src/tests/master_tests.cpp, line 1760
> > <https://reviews.apache.org/r/47082/diff/2/?file=1395974#file1395974line1760>
> >
> >     So the following two tests actually caught something we didn't 
> > anticipate, so instead of "fixing" the tests, we should fix our code:
> >     
> >     The scheduler driver remembers the slave pids for framework messages 
> > and old entries are removed when it receives `LostSlaveMessage`s. The way 
> > we are doing it right now can cause them to be nevered removed because they 
> > don't receive the `LostSlaveMessage`s!
> >     
> >     Note that there are two cases:
> >     
> >     1. If the master fails over, it doesn't know anything about the lost 
> > agent so it doesn't know whehter it should send `LostSlaveMessage`s but it 
> > perahps should.
> >     2. If the master doesn't fail over and a framework's tasks have all 
> > completed before the agent goes lost, it doesn't send `LostSlaveMessage` to 
> > it but then the save pid is not erased on the driver.
> >     
> >     Let's chat about this further.

Fixed case 2. See https://reviews.apache.org/r/48453/.
UPIDs are tracked for each executor by adjusting that entry's life to be from 
the time the executor is launched till the time the executor terminates. As a 
result, not sending LostSlaveMessage to frameworks who have no tasks or 
reservations on that agent would not interfere with cleaning up of slave UPIDs.

Regarding case 1: When the master fails over, we send a LostSlaveMessage if it 
has a task or reservation or pending offer to the framework, and not otherwise. 
I think that should be fine. Let us chat about this case a bit more.


- Anindya


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/47082/#review135384
-----------------------------------------------------------


On May 26, 2016, 11:56 p.m., Anindya Sinha wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/47082/
> -----------------------------------------------------------
> 
> (Updated May 26, 2016, 11:56 p.m.)
> 
> 
> Review request for mesos and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-5143
>     https://issues.apache.org/jira/browse/MESOS-5143
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When a slave is removed, master sends a LostSlaveMessage to affected
> frameworks only (instead of all registered frameworks). An affected
> framework is a framework which satisfied one or more conditions of
> the following:
> 
> 1. There are running tasks on this slave belonging to the framework.
> 2. There are pending tasks on this slave belonging to the framework.
> 3. Reserved resources on the slave have a matching role with the
>    role of the framework.
> 4. There are pending offers or pending inverse offers from this slave
>    for the framework.
> 
> 
> Diffs
> -----
> 
>   src/master/master.hpp 1a875c32eddfb6d884e3d0dda7f5716ee53966c3 
>   src/master/master.cpp 0005a29caabcc6a3776037cf86a2b12660e6377b 
>   src/tests/master_tests.cpp 34be015aa314a7574e9065efb7b1bb8e1570c5b7 
> 
> Diff: https://reviews.apache.org/r/47082/diff/
> 
> 
> Testing
> -------
> 
> All existing and modified tests passed.
> 
> 
> Thanks,
> 
> Anindya Sinha
> 
>

Reply via email to