Re: Review Request 71641: Garbage-collected lost tasks which are reported as running again.

Greg Mann Tue, 29 Oct 2019 16:36:19 -0700


> On Oct. 28, 2019, 6:07 p.m., Benjamin Mahler wrote:
> > src/master/master.cpp
> > Lines 7848 (patched)
> > <https://reviews.apache.org/r/71641/diff/2/?file=2170613#file2170613line7848>
> >
> >     Hm.. don't we enforce agent removal by not allowing the agent to 
> > re-register?
> >     
> >     In the framework removal case, I guess we're not enforcing it?
> >     
> >     Having the task transition out of terminal seems a bit strange for 
> > those two cases (are there other cases?)
> 
> Benjamin Bannier wrote:
>     One scenario where this can happen is maintenance where an agent goes 
> `down` and then `up` again after agent failover. The master will transition 
> the tasks without waiting for task status updates from the agent. This patch 
> adds a test for that (which fails without the patch).
>     
>     I could imagine scenarios involving framework teardown, agent failover, 
> and framework registration using the old `FrameworkID` as well when the 
> master has already forgotten the ID.
>     
>     This patch merely introduces a patch for possible inconsistencies due to 
> the design; we should fix the design as well, see e.g., MESOS-9940 which 
> addresses one framework teardown edge case.
> 
> Benjamin Mahler wrote:
>     Ok, perhaps the patch and comment can be re-framed? "Garbage-collect" 
> sounds like cleaning up old unneeded data, but this is a mitigation papering 
> over possible inconsistency that can arise due flawed design (i.e. lack of 
> enforcement of actions that the master is taking, or in the case of 
> MESOS-9940 probably the master should defer to the agent for the outcome).
>     
>     Tasks are not supposed to be coming out of KILLED (is this possible for 
> other states too?). Perhaps the comment should clarify all exact known cases 
> where this is possible?
>     
>     Perhaps we should also be logging any actual removals as warnings in the 
> log to highlight that it happened?


> Tasks are not supposed to be coming out of KILLED (is this possible for other 
> states too?). Perhaps the comment should clarify all exact known cases where 
> this is possible?

Should we be asserting that the task is in an expected state?


- Greg


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71641/#review218422
-----------------------------------------------------------


On Oct. 28, 2019, 5:53 p.m., Benjamin Bannier wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71641/
> -----------------------------------------------------------
> 
> (Updated Oct. 28, 2019, 5:53 p.m.)
> 
> 
> Review request for mesos, Benno Evers, Benjamin Mahler, and Greg Mann.
> 
> 
> Bugs: MESOS-10018
>     https://issues.apache.org/jira/browse/MESOS-10018
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Under certain conditions tasks which were previously `TASK_LOST` and
> completed can reappear in non-terminal states, e.g., if the agent on
> which they where running reconnect.
> 
> This patch adds garbage collection of such completed tasks so that users
> do not see tasks twice when obtaining task information from the master
> API. This change does not affect tasks status updates where we already
> correctly reported a previously `TASK_LOST` state as superseded by e.g.,
> `TASK_RUNNING`.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 351823e69f14dbb5eb1ea2b108c42e93722f1eff 
>   src/tests/master_tests.cpp 5486e23ce146eda9191e081a48c1f3fcb52a7569 
> 
> 
> Diff: https://reviews.apache.org/r/71641/diff/3/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>

Re: Review Request 71641: Garbage-collected lost tasks which are reported as running again.

Reply via email to