----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64940/#review194777 -----------------------------------------------------------
src/master/master.cpp Lines 10058-10060 (patched) <https://reviews.apache.org/r/64940/#comment273769> Perhaps we just need to comment on the logic we use to determine if the task is `unreachable` and not what `removeTask` is going to do about it (because it's hidden here)? i.e., remove the comment here and use something like what I suggested below? src/master/master.cpp Lines 10061 (patched) <https://reviews.apache.org/r/64940/#comment273760> Perhaps move all the logic around determining the unreachable task down here. ``` // We need to inform `removeTask` whether the task became unreachable due to the agent being removed. Note that terminal tasks are not considered unreachable. We can not rely on the task's state because it may have been set to `TASK_LOST` for backwards-compatibility. bool unreachable = unreachableTime.isSome() && !protobuf::isTerminalState(task->state()); ``` Inlining `unreachable` is probably fine too but separating it helps clarify it with comments? src/master/master.cpp Line 10273 (original), 10279-10280 (patched) <https://reviews.apache.org/r/64940/#comment273766> So we had to use a boolean because we cannot rely on the task *state*, so "any of the unreachable states" is a bit misleading? I think the CHECK message is already clear enough for now? (Like you did in `removeTask`) I suggest we don't comment further on the removable-ness of the task because we might want to change it: https://issues.apache.org/jira/browse/MESOS-8389 src/tests/master_tests.cpp Lines 7601-7602 (patched) <https://reviews.apache.org/r/64940/#comment273805> This reads like "terminal tasks" are stored as "either finished or lost" when it's only the non-terminal task that is transitioned to lost. Can we just say that we are verifying "the terminal task is stored in its original state when parition happens"? src/tests/master_tests.cpp Lines 7603 (patched) <https://reviews.apache.org/r/64940/#comment273804> Most agent partition related tests are put in partition_tests.cpp, can we put this one there as well for better grouping? In particular there is a test named `TaskCompletedOnPartitionedAgent` there which is somewhat related. We can name this one `AgentWithCompletedTaskPartitioned` or `TaskCompletedBeforeAgentPartitioned` The main word I want to change is `Lost`. While I can understand you are using it interchangably with *partitioned*, in this context when it's call partitioned elsewhere and there's `TaskLost` to confuse with it, perhaps it's better to use the more precise word. src/tests/master_tests.cpp Lines 7621 (patched) <https://reviews.apache.org/r/64940/#comment273811> We typically name this `offers` since it's a vector. You'll be using the first and only one so later you should ``` AWAIT_READY(offers); ASSERT_FALSE(offers->empty()); const Offer& offer = offers->at(0); ``` src/tests/master_tests.cpp Lines 7637 (patched) <https://reviews.apache.org/r/64940/#comment273812> You could just wait for the the status update (being droppped) instead of capaturing `Slave::executorTerminated` and risk the race condition of the master not receiving the status update before the agent is deemed partitioned? src/tests/master_tests.cpp Lines 7641 (patched) <https://reviews.apache.org/r/64940/#comment273810> Instead of settling it's more direct to just wait for the offer before using it (like I mentioned above). src/tests/master_tests.cpp Lines 7675 (patched) <https://reviews.apache.org/r/64940/#comment273783> "fast fast" to emphasize? :) src/tests/master_tests.cpp Lines 7681 (patched) <https://reviews.apache.org/r/64940/#comment273784> If you wait for the ping, you don't need to settle? src/tests/master_tests.cpp Lines 7686 (patched) <https://reviews.apache.org/r/64940/#comment273785> No need to settle? - Jiang Yan Xu On Jan. 3, 2018, 5:35 p.m., James Peach wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/64940/ > ----------------------------------------------------------- > > (Updated Jan. 3, 2018, 5:35 p.m.) > > > Review request for mesos, Gaston Kleiman, Jie Yu, and Jiang Yan Xu. > > > Bugs: MESOS-8337 > https://issues.apache.org/jira/browse/MESOS-8337 > > > Repository: mesos > > > Description > ------- > > If an agent is lost, we try to remove all the tasks that might > have been lost. However, if a task is already terminal, it hasn't > really been lost so we should not be tracking it in the framework's > unreachable tasks list. > > > Diffs > ----- > > src/master/master.hpp 8fe9420dbe03ea2cefc6a40b0f64284aa9fe7915 > src/master/master.cpp 282fdf8ac38e3613c621c1c8e5c50f27bde9dafd > src/tests/master_tests.cpp 5546fd937d078c1f757964d5163449ffd993388e > > > Diff: https://reviews.apache.org/r/64940/diff/1/ > > > Testing > ------- > > make check (Fedora 27) > > > Thanks, > > James Peach > >