----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/61473/#review186615 -----------------------------------------------------------
src/master/http.cpp Lines 324 (patched) <https://reviews.apache.org/r/61473/#comment263308> Leave a blank line between these if blocks. Here and below. src/master/http.cpp Lines 350 (patched) <https://reviews.apache.org/r/61473/#comment263309> Leave a blank line between these if blocks. src/master/http.cpp Line 4203 (original), 4220-4226 (patched) <https://reviews.apache.org/r/61473/#comment263501> NPA tasks already have the state `TASK_LOST` right? Could you check that? The difference is that previously, if a framework was PA (and its unreachable tasks were stored in `unreachableTasks` as `UNREACHABLE`) and later it changed to NPA again, the code here made no attempt to double check to change it to `TASK_LOST`. The new code would do it though. It's debatable which is more right but I'd say it's safe to maintain the existing behavior? src/master/master.hpp Lines 2475-2477 (patched) <https://reviews.apache.org/r/61473/#comment263317> Because this method takes a pointer, this mutation could affect future uses of it. Even though right now nothing that cares about the state follows the call of `addUnreachableTask`, it may still be good to not propagate the change. How about doing it in `addUnreachableTask`? ``` void addUnreachableTask(const Task& _task) { Task* task = new Task(_task); // We have to use TASK_LOST for non-partition-aware frameworks // for backwards compatibility. if (!capabilities.partitionAware) { task->set_state(TASK_LOST); } unreachableTasks.set(task.task_id(), process::Owned<Task>(task)); } ``` src/master/master.hpp Line 2844 (original), 2840 (patched) <https://reviews.apache.org/r/61473/#comment263318> s/non-/Non-/ to be consistent with other instances of `NOTE`s. src/master/master.cpp Line 6448 (original), 6439 (patched) <https://reviews.apache.org/r/61473/#comment263319> The decision is pretty simple right now: all tasks are added unless the framework is completed. So this separate introduction paragraph seems redudant and can be removed now: ``` // Decide how to handle the tasks running on the agent: // ``` src/master/master.cpp Line 6450 (original), 6441 (patched) <https://reviews.apache.org/r/61473/#comment263320> "All tasks" except for those belonging to completed frameworks, so to summarize the block of code below, we should probably mention this. src/master/master.cpp Lines 6453-6456 (original), 6443-6446 (patched) <https://reviews.apache.org/r/61473/#comment263321> "no further cleanup is required" is true for all circumstances now so I think this comment is not important anymore. If we comment about the code below ignoring the history of how we got here but just as it is, we could probably just remove the whole sentence: ``` // If the master has failed // over since the agent was marked unreachable then the master shouldn't // have any record of the tasks running on the agent, so no further // cleanup is required. ``` src/master/master.cpp Lines 7188-7190 (original), 7144-7146 (patched) <https://reviews.apache.org/r/61473/#comment263411> Our handling of `TASK_UNREACHABLE` vs. `TASK_LOST` here is a little different than elsewhere so I think this warrants a bit of explanation. e.g., ``` // Transition tasks to TASK_UNREACHABLE and remove (archive) them. // We convert the task state to TASK_LOST if is the framework is not partition aware. // However we only do the conversion right before the status update is sent out or the // task is archived because the processing prior to then requires tasks to be of the // correct state TASK_UNREACHABLE. ``` Does this sound right? src/master/master.cpp Lines 7169-7173 (patched) <https://reviews.apache.org/r/61473/#comment263406> You have created this `newUpdate` but are not immediately using it, and I have to pay attention to notice that it is `update` instead of `newUpdate` that is passed to `updateTask()`. So perhaps defer this change to after `updateTask` and `removeTask` are called? At that point because the only remaining use of the `update` is to send it out, you don't even need to make a copy any more. Just change the field. src/master/master.cpp Lines 7171-7172 (patched) <https://reviews.apache.org/r/61473/#comment263399> This could be shortened to one line. ``` update.mutable_status()->set_state(TASK_LOST); ``` src/master/master.cpp Line 7218 (original), 7174 (patched) <https://reviews.apache.org/r/61473/#comment263402> Looking into the method `updateTask`, passing in `update` here will cause the operator API subscribers to receive `TASK_UNREACHABLE` for NPA task. So we should probably handle it inside. src/master/master.cpp Lines 8989-8990 (original), 8945-8946 (patched) <https://reviews.apache.org/r/61473/#comment263403> This is going to send `TASK_UNREACHABLE` to the operator API subscribers even for NPA framework tasks. We should probably be consistent and send `TASK_LOST`. - Jiang Yan Xu On Sept. 24, 2017, 3:46 p.m., Megha Sharma wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/61473/ > ----------------------------------------------------------- > > (Updated Sept. 24, 2017, 3:46 p.m.) > > > Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu. > > > Bugs: MESOS-7215 > https://issues.apache.org/jira/browse/MESOS-7215 > > > Repository: mesos > > > Description > ------- > > Master will not kill the tasks for non-Partition aware frameworks > when an unreachable agent re-registers with the master. > Master used to send a ShutdownFrameworkMessages to the agent > to kill the tasks from non partition aware frameworks including the > ones that are still registered which was problematic because the offer > from this agent could still go to the same framework which could then > launch new tasks. The agent would then receive tasks of the same > framework and ignore them because it thinks the framework is shutting > down. The framework is not shutting down of course, so from the master > and the scheduler’s perspective the task is pending in STAGING forever > until the next agent reregistration, which could happen much later. > This commit fixes the problem by not shutting down the non-partition > aware frameworks on such an agent. > > > Diffs > ----- > > src/master/http.cpp 28d0393fb5962df4d731521265efd81a54e1e655 > src/master/master.hpp 05f88111afb4fa0e2baf57106e1479914c16a113 > src/master/master.cpp 6d84a26bff970b842b58dfb69dbf232ba5c16a20 > src/tests/partition_tests.cpp 0886f4890ac3fec6f38146946892769a99c3e68f > > > Diff: https://reviews.apache.org/r/61473/diff/7/ > > > Testing > ------- > > make check > > > Thanks, > > Megha Sharma > >
