----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/58898/#review174194 -----------------------------------------------------------
My apologies for the delay in reviewing this. High-level comments: (a) Can we improve the description of the problem in the commit summary? It took me quite a while to figure out what is actually going on here. My understanding is: (1) Agent re-registers (2) Master sends `ShutdownFrameworkMessage` for non-PA frameworks on the agent (3) Master offers agent resources to framework (4) Framework launches new task on agent _before the agent has finished shutting down the framework_ (5) Agent ignores launch task because it believes the framework is still terminating. This is a race condition, because if the agent finished shutting down the framework (in #4) before the launch task was received, there would not be a problem. Is my understanding correct? (2) I'd prefer a new unit test that constructs exactly this scenario, rather than changing existing unit tests. (3) The new behavior is that the framework will receive `TASK_KILLED` for any non-PA tasks running on a partitioned agent that re-registers. This does not seem ideal, because `TASK_KILLED` _normally_ corresponds to a framework-initiated `killTask` operation. (4) Thinking out loud, what if a custom executor ignores the `killTask` request? When shutting down a framework, it seems we forcibly destroy the container (via `containerizer->destroy()`), if the executor doesn't exit promptly upon receiving the framework shutdown request. AFAIK we don't have similar logic for `killTask` requests. 3 + 4 suggests that maybe we want a different way to terminate the task on the agent -- let's discuss. src/master/master.cpp Lines 6034 (patched) <https://reviews.apache.org/r/58898/#comment247713> `tasksToKill` src/master/master.cpp Lines 6057 (patched) <https://reviews.apache.org/r/58898/#comment247313> "Keep" src/master/master.cpp Lines 6058 (patched) <https://reviews.apache.org/r/58898/#comment247312> "separate data structure" src/master/master.cpp Line 6097 (original), 6102 (patched) <https://reviews.apache.org/r/58898/#comment247314> "launched by" src/tests/partition_tests.cpp Lines 693 (patched) <https://reviews.apache.org/r/58898/#comment247742> Why did you move this to before the agent re-registers? The point of doing explicit reconciliation is to check the master's state after the agent re-registers. There's no harm in _also_ reconciling before the agent re-registers, but if you want to add that, let's do it in a separate review. src/tests/partition_tests.cpp Line 754 (original), 786 (patched) <https://reviews.apache.org/r/58898/#comment247330> `EXPECT_FALSE`. src/tests/partition_tests.cpp Line 794 (original), 826 (patched) <https://reviews.apache.org/r/58898/#comment247746> The point of this part of the test is to check that the agent's metrics are correct after re-registration (and before starting any additional tasks). I'd prefer to check the metrics, then start another task on the agent afterward. src/tests/partition_tests.cpp Line 2049 (original) <https://reviews.apache.org/r/58898/#comment247745> Removing this results in two gmock warnings: ``` GMOCK WARNING: Uninteresting mock function call - returning directly. Function call: killTask(0x7fbcf1c5fa20, @0x7fbcf1c54640 1) ``` ``` GMOCK WARNING: Uninteresting mock function call - returning directly. Function call: shutdown(0x7fbcf1c5fa20) ``` - Neil Conway On May 1, 2017, 11:58 p.m., Megha Sharma wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/58898/ > ----------------------------------------------------------- > > (Updated May 1, 2017, 11:58 p.m.) > > > Review request for mesos, Neil Conway and Jiang Yan Xu. > > > Bugs: MESOS-7215 > https://issues.apache.org/jira/browse/MESOS-7215 > > > Repository: mesos > > > Description > ------- > > Mesos is now sending ShutdownFrameworkMessages to the agent for all > non-partition-aware frameworks (including the ones that are still > registered). This is problematic. The offer from this agent can > still go to the same framework which can then launch new tasks. > The agent then receives tasks of the same framework and ignores > them because it thinks the framework is shutting down. The framework > is not shutting down of course, so from the master and the scheduler's > perspective the task is pending in STAGING forever until the next agent > reregistration, which could happen much later. This commit fixes > the problem by sending a task kill instead of ShutdownFrameworkMessage, > when the agent re-registers after being unreachable, for non-partition > aware framewworks. > > > Diffs > ----- > > src/master/master.cpp 31a7a2fcf905c0c35e80692a69c290d4094deded > src/tests/partition_tests.cpp 4ff428564d1fa6cb96e6f8ec8edc331da88a3eb6 > > > Diff: https://reviews.apache.org/r/58898/diff/2/ > > > Testing > ------- > > make check > > > Thanks, > > Megha Sharma > >
