Re: Review Request 53239: Changed master to make use of "retired" agent IDs.

Vinod Kone Fri, 04 Nov 2016 17:20:12 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53239/#review155028
-----------------------------------------------------------





src/master/master.cpp (lines 5144 - 5148)
<https://reviews.apache.org/r/53239/#comment224855>

    If the slave thinks an ID needs to be retired, I think it makes sense for 
the master to send a shutdown to that slave and remove it? If we just ignore it 
how is the situation going to resolve?



src/master/master.cpp (line 5165)
<https://reviews.apache.org/r/53239/#comment224859>

    why are we marking it unreachable instead of removing it directly?



src/master/master.cpp (line 5170)
<https://reviews.apache.org/r/53239/#comment224857>

    Retirement



src/master/master.cpp (line 5187)
<https://reviews.apache.org/r/53239/#comment224856>

    what's the plan for including old agent ids when the agent doesn't reboot? 
also, should we rename it to "old_slave_ids" instead of "retired_slave_ids" ?


- Vinod Kone


On Oct. 27, 2016, 10:46 p.m., Neil Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53239/
> -----------------------------------------------------------
> 
> (Updated Oct. 27, 2016, 10:46 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-5396
>     https://issues.apache.org/jira/browse/MESOS-5396
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> A retired agent ID will never attempt to re-register in the future;
> moreover, any tasks/executors being managed by that agent ID are no
> longer running. We can take advantage of this knowledge to avoid waiting
> for `agent_reregister_timeout` to expire after master failover.
> 
> This is particularly important when agent removal rate-limiting is in
> use: if a power failure causes the master to fail at the same time that
> many agent hosts lose power, when power returns the master will failover
> and all the agents will register anew and receive new agent IDs. With
> agent removal rate-limiting, it may take a long time for the master to
> mark all the old agent IDs as unreachable; in the meantime, explicit
> reconciliation will not return any results, potentially leaving
> frameworks in limbo for an extended period.
> 
> Note that we currently mark retired agents as unreachable; in the near
> future, that will change to marking such agents "gone", once support for
> that feature is completed.
> 
> 
> Diffs
> -----
> 
>   src/master/master.hpp 87186c6e733a686f96528b1722fda1c287e9c881 
>   src/master/master.cpp 8692726d21812827f9e1fd9093d80fd260588ecb 
>   src/tests/slave_recovery_tests.cpp 65fc18bc2732dc53581d39ee23368e347f0b2ca4 
> 
> Diff: https://reviews.apache.org/r/53239/diff/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> NOTE: Current implementation reuses `Master::markUnreachableAfterFailover`, 
> which means we emit misleading log messages and increment the wrong metrics. 
> Will adjust based on initial review comments.
> 
> 
> Thanks,
> 
> Neil Conway
> 
>

Re: Review Request 53239: Changed master to make use of "retired" agent IDs.

Reply via email to