[ 
https://issues.apache.org/jira/browse/MESOS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-5396:
-------------------------------
    Description: 
Scenario:

* master fails over
* an agent host is restarted; the agent attempts to *register* (not reregister) 
with Mesos using the same UPID as the previous agent instance; this means it 
will get a new agent ID
* framework isn't notified about the status of the tasks on the *old* slaveID 
until the {{agent_reregister_timeout}} expires (10 mins)

This isn't necessarily wrong but it is suboptimal: when the agent attempts to 
register with the same UPID that was used by the previous agent instance, we 
know that a *reregistration* attempt for the old <UPID, agentID> pair will 
never be seen. Hence we can declare the old agentID to be gone-forever and 
notify frameworks appropriately, without waiting for the full 
{{agent_reregister_timeout}} to expire.

Note that we already implement the proposed behavior for the case when the 
master does *not* failover 
(https://github.com/apache/mesos/blob/0.28.1/src/master/master.cpp#L4162-L4172).

  was:
Scenario:

* master fails over
* an agent host is restarted; the agent attempts to register with Mesos using 
the same UPID as the previous agent instance; this means it will get a new 
agent ID
* framework isn't notified about the status of the tasks on the *old* slaveID 
until the slave_reregister_timeout expires (10 mins)

This isn't necessarily wrong, but it is suboptimal: when the slave attempts to 
register with the same UPID that was used by the previous slave instance, we 
know that a *reregistration* attempt for the old <UPID, slaveID> pair will 
never be seen. Hence we can declare the old slaveID to be gone-forever and 
notify frameworks appropriately, without waiting for the full 
slave_reregister_timeout to expire.

Note that we already implement the proposed behavior for the case when the 
master does *not* failover 
(https://github.com/apache/mesos/blob/0.28.1/src/master/master.cpp#L4162-L4172).


> After failover, master does not remove agents with same UPID
> ------------------------------------------------------------
>
>                 Key: MESOS-5396
>                 URL: https://issues.apache.org/jira/browse/MESOS-5396
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>            Reporter: Neil Conway
>            Assignee: Neil Conway
>            Priority: Critical
>              Labels: mesosphere
>
> Scenario:
> * master fails over
> * an agent host is restarted; the agent attempts to *register* (not 
> reregister) with Mesos using the same UPID as the previous agent instance; 
> this means it will get a new agent ID
> * framework isn't notified about the status of the tasks on the *old* slaveID 
> until the {{agent_reregister_timeout}} expires (10 mins)
> This isn't necessarily wrong but it is suboptimal: when the agent attempts to 
> register with the same UPID that was used by the previous agent instance, we 
> know that a *reregistration* attempt for the old <UPID, agentID> pair will 
> never be seen. Hence we can declare the old agentID to be gone-forever and 
> notify frameworks appropriately, without waiting for the full 
> {{agent_reregister_timeout}} to expire.
> Note that we already implement the proposed behavior for the case when the 
> master does *not* failover 
> (https://github.com/apache/mesos/blob/0.28.1/src/master/master.cpp#L4162-L4172).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to