----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/61473/#review188966 -----------------------------------------------------------
src/master/http.cpp Line 317 (original), 317 (patched) <https://reviews.apache.org/r/61473/#comment266299> This comment isn't about this line but it seems my previous comment at the top of my last review was overlooked so I am raising an issue here :) For this patch we should also update the comments above PARTITION_AWARE API to reflect the change. https://github.com/apache/mesos/blob/6eefc685ccf304d0fb8ed4ff9bc314197d77f078/include/mesos/mesos.proto#L336 Also please search the docs for necessary changes. src/master/http.cpp Lines 322-324 (patched) <https://reviews.apache.org/r/61473/#comment266286> Seems like if the framework is NPA, we don't have to iterate through the list? i.e., put the check outside the `foreachvalue` loop? src/master/http.cpp Lines 345-346 (patched) <https://reviews.apache.org/r/61473/#comment266288> Given that we now store all NPA tasks in TASK_LOST (instead of being converted to TASK_LOST on the fly here), is it more straightfoward to say ``` // Unreachable tasks belonging to a non-partition-aware framework have been stored as TASK_LOST for backward compatibility so we should export them as completed tasks. ``` ? src/master/http.cpp Lines 348-350 (patched) <https://reviews.apache.org/r/61473/#comment266285> The reason we are here is that tasks in `framework_->unreachableTasks` could be in `TASK_LOST` state and we should export those as "completed_tasks". So I think the condition check isn't right? `if (framework_->capabilities.partitionAware)` is going to skip all tasks previously stored as `TASK_LOST`. Like I suggested in another place, should we be checking against `if (task.get()->state() != TASK_LOST)` directly? src/master/http.cpp Lines 4225 (patched) <https://reviews.apache.org/r/61473/#comment266293> Is is more straightfoward to do `if (task.get()->state() == TASK_UNREACHABLE)` than the inequality check? src/master/http.cpp Lines 4228-4229 (patched) <https://reviews.apache.org/r/61473/#comment266295> Ditto on the update of the comment. src/master/master.hpp Lines 2591 (patched) <https://reviews.apache.org/r/61473/#comment265908> You made a redundant copy here but I understand this line may go away anyways. :) src/master/master.hpp Line 2588 (original), 2594 (patched) <https://reviews.apache.org/r/61473/#comment266296> Sorry we have gone back and forth but for now let's use this API: ``` void removeTask(Task* task, bool unreachable) ``` I originally hoped to solely rely on `task->state()` in the workflow but it turned out to be pretty difficult. So here let's add a comment above the method explaning the reason for the boolean. e.g., ``` // Removes the task. `unreachable` indicates whether the task is removed due to being unreachable. Note that we cannot rely on the task state because it may not reflect unreachability due to being set to TASK_LOST for backwards compatibility. ``` src/master/master.cpp Lines 9683-9684 (original), 9637-9639 (patched) <https://reviews.apache.org/r/61473/#comment266297> If the framework is PA, it can still contain tasks in `TASK_LOST`. For now let's follow the implementation in `Master::_tasks_killing()`: check the state of each task. src/tests/partition_tests.cpp Lines 2044-2045 (patched) <https://reviews.apache.org/r/61473/#comment266298> A comment shouldn't split the chain. I think ``` .WillRepeatedly(Return()); // The agent may resend status updates. ``` fits in one line? - Jiang Yan Xu On Oct. 16, 2017, 1:59 a.m., Megha Sharma wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/61473/ > ----------------------------------------------------------- > > (Updated Oct. 16, 2017, 1:59 a.m.) > > > Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu. > > > Bugs: MESOS-7215 > https://issues.apache.org/jira/browse/MESOS-7215 > > > Repository: mesos > > > Description > ------- > > Master will not kill the tasks for non-Partition aware frameworks > when an unreachable agent re-registers with the master. > Master used to send a ShutdownFrameworkMessages to the agent > to kill the tasks from non partition aware frameworks including the > ones that are still registered which was problematic because the offer > from this agent could still go to the same framework which could then > launch new tasks. The agent would then receive tasks of the same > framework and ignore them because it thinks the framework is shutting > down. The framework is not shutting down of course, so from the master > and the scheduler’s perspective the task is pending in STAGING forever > until the next agent reregistration, which could happen much later. > This commit fixes the problem by not shutting down the non-partition > aware frameworks on such an agent. > > > Diffs > ----- > > src/master/http.cpp 42139bec519d36316e324ef921157c49cdf2d043 > src/master/master.hpp 0ddc98259f64b3921d08f5f4ec81543bb0826378 > src/master/master.cpp 3603878f02ae3dba82811a4a5770dd21ec790ef6 > src/tests/partition_tests.cpp 0597bd2afaa60121245e0d43b81ac223257e377a > > > Diff: https://reviews.apache.org/r/61473/diff/8/ > > > Testing > ------- > > make check > > > Thanks, > > Megha Sharma > >
