> On Sept. 29, 2017, 6:19 p.m., Jiang Yan Xu wrote: > > src/master/master.cpp > > Lines 7188-7190 (original), 7144-7146 (patched) > > <https://reviews.apache.org/r/61473/diff/7/?file=1819742#file1819742line7188> > > > > Our handling of `TASK_UNREACHABLE` vs. `TASK_LOST` here is a little > > different than elsewhere so I think this warrants a bit of explanation. > > > > e.g., > > ``` > > // Transition tasks to TASK_UNREACHABLE and remove (archive) them. > > // We convert the task state to TASK_LOST if is the framework is not > > partition aware. > > // However we only do the conversion right before the status update is > > sent out or the > > // task is archived because the processing prior to then requires tasks > > to be of the > > // correct state TASK_UNREACHABLE. > > ``` > > > > Does this sound right?
+1 > On Sept. 29, 2017, 6:19 p.m., Jiang Yan Xu wrote: > > src/master/master.cpp > > Lines 8989-8990 (original), 8945-8946 (patched) > > <https://reviews.apache.org/r/61473/diff/7/?file=1819742#file1819742line8994> > > > > This is going to send `TASK_UNREACHABLE` to the operator API > > subscribers even for NPA framework tasks. > > > > We should probably be consistent and send `TASK_LOST`. Right, missed it. So, one way to solve it is to let the state be TASK_LOST for NPA and change it to TASK_UNREACHABLE just before calling removeTask() so the task goes to unreachable tasks datastructure. - Megha ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/61473/#review186615 ----------------------------------------------------------- On Oct. 16, 2017, 8:59 a.m., Megha Sharma wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/61473/ > ----------------------------------------------------------- > > (Updated Oct. 16, 2017, 8:59 a.m.) > > > Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu. > > > Bugs: MESOS-7215 > https://issues.apache.org/jira/browse/MESOS-7215 > > > Repository: mesos > > > Description > ------- > > Master will not kill the tasks for non-Partition aware frameworks > when an unreachable agent re-registers with the master. > Master used to send a ShutdownFrameworkMessages to the agent > to kill the tasks from non partition aware frameworks including the > ones that are still registered which was problematic because the offer > from this agent could still go to the same framework which could then > launch new tasks. The agent would then receive tasks of the same > framework and ignore them because it thinks the framework is shutting > down. The framework is not shutting down of course, so from the master > and the scheduler’s perspective the task is pending in STAGING forever > until the next agent reregistration, which could happen much later. > This commit fixes the problem by not shutting down the non-partition > aware frameworks on such an agent. > > > Diffs > ----- > > src/master/http.cpp 42139bec519d36316e324ef921157c49cdf2d043 > src/master/master.hpp 0ddc98259f64b3921d08f5f4ec81543bb0826378 > src/master/master.cpp 3603878f02ae3dba82811a4a5770dd21ec790ef6 > src/tests/partition_tests.cpp 0597bd2afaa60121245e0d43b81ac223257e377a > > > Diff: https://reviews.apache.org/r/61473/diff/8/ > > > Testing > ------- > > make check > > > Thanks, > > Megha Sharma > >
