> On Oct. 28, 2019, 6:07 p.m., Benjamin Mahler wrote: > > src/master/master.cpp > > Lines 7848 (patched) > > <https://reviews.apache.org/r/71641/diff/2/?file=2170613#file2170613line7848> > > > > Hm.. don't we enforce agent removal by not allowing the agent to > > re-register? > > > > In the framework removal case, I guess we're not enforcing it? > > > > Having the task transition out of terminal seems a bit strange for > > those two cases (are there other cases?) > > Benjamin Bannier wrote: > One scenario where this can happen is maintenance where an agent goes > `down` and then `up` again after agent failover. The master will transition > the tasks without waiting for task status updates from the agent. This patch > adds a test for that (which fails without the patch). > > I could imagine scenarios involving framework teardown, agent failover, > and framework registration using the old `FrameworkID` as well when the > master has already forgotten the ID. > > This patch merely introduces a patch for possible inconsistencies due to > the design; we should fix the design as well, see e.g., MESOS-9940 which > addresses one framework teardown edge case. > > Benjamin Mahler wrote: > Ok, perhaps the patch and comment can be re-framed? "Garbage-collect" > sounds like cleaning up old unneeded data, but this is a mitigation papering > over possible inconsistency that can arise due flawed design (i.e. lack of > enforcement of actions that the master is taking, or in the case of > MESOS-9940 probably the master should defer to the agent for the outcome). > > Tasks are not supposed to be coming out of KILLED (is this possible for > other states too?). Perhaps the comment should clarify all exact known cases > where this is possible? > > Perhaps we should also be logging any actual removals as warnings in the > log to highlight that it happened?
> Tasks are not supposed to be coming out of KILLED (is this possible for other > states too?). Perhaps the comment should clarify all exact known cases where > this is possible? Should we be asserting that the task is in an expected state? - Greg ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71641/#review218422 ----------------------------------------------------------- On Oct. 28, 2019, 5:53 p.m., Benjamin Bannier wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/71641/ > ----------------------------------------------------------- > > (Updated Oct. 28, 2019, 5:53 p.m.) > > > Review request for mesos, Benno Evers, Benjamin Mahler, and Greg Mann. > > > Bugs: MESOS-10018 > https://issues.apache.org/jira/browse/MESOS-10018 > > > Repository: mesos > > > Description > ------- > > Under certain conditions tasks which were previously `TASK_LOST` and > completed can reappear in non-terminal states, e.g., if the agent on > which they where running reconnect. > > This patch adds garbage collection of such completed tasks so that users > do not see tasks twice when obtaining task information from the master > API. This change does not affect tasks status updates where we already > correctly reported a previously `TASK_LOST` state as superseded by e.g., > `TASK_RUNNING`. > > > Diffs > ----- > > src/master/master.cpp 351823e69f14dbb5eb1ea2b108c42e93722f1eff > src/tests/master_tests.cpp 5486e23ce146eda9191e081a48c1f3fcb52a7569 > > > Diff: https://reviews.apache.org/r/71641/diff/3/ > > > Testing > ------- > > `make check` > > > Thanks, > > Benjamin Bannier > >
