[
https://issues.apache.org/jira/browse/MESOS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexander Rukletsov updated MESOS-5396:
---------------------------------------
Summary: After failover, master does not remove agents with same UPID.
(was: After failover, master does not remove agents with same UPID)
> After failover, master does not remove agents with same UPID.
> -------------------------------------------------------------
>
> Key: MESOS-5396
> URL: https://issues.apache.org/jira/browse/MESOS-5396
> Project: Mesos
> Issue Type: Bug
> Components: master
> Reporter: Neil Conway
> Assignee: Neil Conway
> Priority: Critical
> Labels: mesosphere
>
> Scenario:
> * master fails over
> * an agent host is restarted; the agent attempts to *register* (not
> reregister) with Mesos using the same UPID as the previous agent instance;
> this means it will get a new agent ID
> * framework isn't notified about the status of the tasks on the *old* agentID
> until the {{agent_reregister_timeout}} expires (10 mins)
> This isn't necessarily wrong but it is suboptimal: when the agent attempts to
> register with the same UPID that was used by the previous agent instance, we
> know that a *reregistration* attempt for the old <UPID, agentID> pair will
> never be seen. Hence we can declare the old agentID to be gone-forever and
> notify frameworks appropriately, without waiting for the full
> {{agent_reregister_timeout}} to expire.
> Note that we already implement the proposed behavior for the case when the
> master does *not* failover
> (https://github.com/apache/mesos/blob/0.28.1/src/master/master.cpp#L4162-L4172).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)