[ 
https://issues.apache.org/jira/browse/MESOS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-5396:
-----------------------------------
    Priority: Critical  (was: Major)

Bumping the priority. Note that this situation can extend far beyond the 
{{--\[agent|slave]_reregister_timeout}} due to the removal rate limiting. One 
example occurred when the cluster experienced a large-scale power loss and so a 
large number of agents are removed (such that with the rate limit applied it 
would have taken O(days) to remove all of them). If the framework does not 
provide the SlaveID during explicit reconciliation, no progress can be made 
until all of the removals complete.

If the patch is straightforward enough, backports would be great.

[~neilc] should this be assigned to you still?

> After failover, master does not remove agents with same UPID
> ------------------------------------------------------------
>
>                 Key: MESOS-5396
>                 URL: https://issues.apache.org/jira/browse/MESOS-5396
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>            Reporter: Neil Conway
>            Assignee: Neil Conway
>            Priority: Critical
>              Labels: mesosphere
>
> Scenario:
> * master fails over
> * an agent host is restarted; the agent attempts to register with Mesos using 
> the same UPID as the previous agent instance; this means it will get a new 
> agent ID
> * framework isn't notified about the status of the tasks on the *old* slaveID 
> until the slave_reregister_timeout expires (10 mins)
> This isn't necessarily wrong, but it is suboptimal: when the slave attempts 
> to register with the same UPID that was used by the previous slave instance, 
> we know that a *reregistration* attempt for the old <UPID, slaveID> pair will 
> never be seen. Hence we can declare the old slaveID to be gone-forever and 
> notify frameworks appropriately, without waiting for the full 
> slave_reregister_timeout to expire.
> Note that we already implement the proposed behavior for the case when the 
> master does *not* failover 
> (https://github.com/apache/mesos/blob/0.28.1/src/master/master.cpp#L4162-L4172).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to