----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/47082/#review135384 -----------------------------------------------------------
See the comment I had on the tests: I think we should revisit the semantics for this message (when it should be sent). A few more points: 1. Is there a test that covers "the basic flow": `slaveLost()` is being called because it does have reserevd resources or tasks on the agent (in the same test we can have a idle framework which doesn't get notified)? 2. Since this changes Mesos behavior observable by the frameworks, we should update documentation for scheduler API `slaveLost()` to reflect this. 3. For the reason in 2), we should send an email to the dev and user list about this change before we commit it. src/master/master.hpp (lines 321 - 323) <https://reviews.apache.org/r/47082/#comment200397> 1. Would it make sense to use `multihashmap` to avoid the manual cleanups of empty `hashset` etc.? 2. Just so it's clearer to the readers let's comment on why we need to track them here. Perhaps ``` // Tasks that have not yet been launched because they are currently // being authorized. Similar to 'pendingTasks' in Framwork but we // track them separately here to determine which frameworks are // affected when an agent is removed. ``` src/master/master.cpp (line 3455) <https://reviews.apache.org/r/47082/#comment201098> The **what** in this comment feels unnecessary as the code is self-explanatory. If we add some comment here perhaps comment on **why**. Maybe: ``` // Add to the slave's list of pending tasks. This is // necessary to track frameworks who own any pending // task affected by a lost agent. ``` src/master/master.cpp (lines 3456 - 3459) <https://reviews.apache.org/r/47082/#comment201094> If we are using `multihashmap` this could be: ``` if (!slave->pendingTasks.contains(framework->id(), task.task_id()) { slave->pendingTasks.put(framework->id(), task.task_id()); } ``` src/master/master.cpp (lines 3838 - 3841) <https://reviews.apache.org/r/47082/#comment200412> We can keep the original comment which can cover both. No need to comment separately since the statements here are pretty short and easy to understand. src/master/master.cpp (line 4119) <https://reviews.apache.org/r/47082/#comment200413> Let's use `slave != nullptr` for consistentcy with the rest of the file. (Note that MESOS-3243 was recently committed) src/master/master.cpp (line 6577) <https://reviews.apache.org/r/47082/#comment201142> s/if/if there are/ So the bullet points are more consistently constructed. src/master/master.cpp (line 6578) <https://reviews.apache.org/r/47082/#comment201143> ``` 1. Reserved resources on this slave that belong to the framework's role; or ``` Note that there are also statically reserved resources that are on this slave which aren't checkpointed. src/master/master.cpp (line 6579) <https://reviews.apache.org/r/47082/#comment201144> ``` 2. Tasks on this slave launched by the framework; or ``` We should avoid using `running` to avoid confusing because it may not be `RUNNING` yet on the slave. We can just call them either tasks or pending tasks. src/master/master.cpp (line 6580) <https://reviews.apache.org/r/47082/#comment201147> nit: s/on/for/ (as in destined for) because it's not added to/launched onto the slave yet. src/master/master.cpp (line 6581) <https://reviews.apache.org/r/47082/#comment201154> src/master/master.cpp (line 6590) <https://reviews.apache.org/r/47082/#comment201155> We should look through `slave->totalResources.reserved()` which includes statically reserved resources. src/master/master.cpp (lines 6591 - 6593) <https://reviews.apache.org/r/47082/#comment201139> Per the comment above, this is not needed now. src/master/master.cpp (line 6627) <https://reviews.apache.org/r/47082/#comment201157> s/there are running/it has/ src/master/master.cpp (line 6636) <https://reviews.apache.org/r/47082/#comment201159> No need to copy, you are not modifying `slave->pendingTasks` while iterating. Also align it like this ``` foreachkey (const FrameworkID& frameworkId, slave->pendingTasks) { } ``` if it exceeded 80 chars (which looks like it didn't so it should be put on the same line). src/master/master.cpp (line 6639) <https://reviews.apache.org/r/47082/#comment201161> It's naturally taken care of when `Slave` is destructed right? Plus always use an empty line between `}` and the next statement. src/master/master.cpp (line 6651) <https://reviews.apache.org/r/47082/#comment201166> s/lost slave/the lost slave/. src/master/master.cpp (lines 6666 - 6667) <https://reviews.apache.org/r/47082/#comment201165> ``` // Add the frameworks that have pending inverse offers from this // slave for notification for the lost slave. ``` i.e., - s/lost slave/the lost slave/ - Also, while 70 char limit for comments is not a hard rule but consider using it if it makes things look less jagged. src/tests/master_tests.cpp (line 1760) <https://reviews.apache.org/r/47082/#comment201248> So the following two tests actually caught something we didn't anticipate, so instead of "fixing" the tests, we should fix our code: The scheduler driver remembers the slave pids for framework messages and old entries are removed when it receives `LostSlaveMessage`s. The way we are doing it right now can cause them to be nevered removed because they don't receive the `LostSlaveMessage`s! Note that there are two cases: 1. If the master fails over, it doesn't know anything about the lost agent so it doesn't know whehter it should send `LostSlaveMessage`s but it perahps should. 2. If the master doesn't fail over and a framework's tasks have all completed before the agent goes lost, it doesn't send `LostSlaveMessage` to it but then the save pid is not erased on the driver. Let's chat about this further. src/tests/master_tests.cpp (lines 1956 - 1957) <https://reviews.apache.org/r/47082/#comment201174> - Jiang Yan Xu On May 26, 2016, 4:56 p.m., Anindya Sinha wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/47082/ > ----------------------------------------------------------- > > (Updated May 26, 2016, 4:56 p.m.) > > > Review request for mesos and Jiang Yan Xu. > > > Bugs: MESOS-5143 > https://issues.apache.org/jira/browse/MESOS-5143 > > > Repository: mesos > > > Description > ------- > > When a slave is removed, master sends a LostSlaveMessage to affected > frameworks only (instead of all registered frameworks). An affected > framework is a framework which satisfied one or more conditions of > the following: > > 1. There are running tasks on this slave belonging to the framework. > 2. There are pending tasks on this slave belonging to the framework. > 3. Reserved resources on the slave have a matching role with the > role of the framework. > 4. There are pending offers or pending inverse offers from this slave > for the framework. > > > Diffs > ----- > > src/master/master.hpp 1a875c32eddfb6d884e3d0dda7f5716ee53966c3 > src/master/master.cpp 0005a29caabcc6a3776037cf86a2b12660e6377b > src/tests/master_tests.cpp 34be015aa314a7574e9065efb7b1bb8e1570c5b7 > > Diff: https://reviews.apache.org/r/47082/diff/ > > > Testing > ------- > > All existing and modified tests passed. > > > Thanks, > > Anindya Sinha > >
