[
https://issues.apache.org/jira/browse/MESOS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16061604#comment-16061604
]
Yan Xu commented on MESOS-5396:
-------------------------------
As noted in MESOS-6223, even when MESOS-6223 is merged, there's still a chance
for the agent to recover as a new agent after reboot, if the agent info has
changed during the reboot.
Therefore, even if this may affect the ticket's priority, this condition is
still valid. This also includes cases where the agent restarted with new ID
without host reboots (latest symlink removed).
> After failover, master does not remove agents with same UPID.
> -------------------------------------------------------------
>
> Key: MESOS-5396
> URL: https://issues.apache.org/jira/browse/MESOS-5396
> Project: Mesos
> Issue Type: Bug
> Components: master
> Reporter: Neil Conway
> Assignee: Neil Conway
> Priority: Critical
> Labels: mesosphere
>
> Scenario:
> * master fails over
> * an agent host is restarted; the agent attempts to *register* (not
> reregister) with Mesos using the same UPID as the previous agent instance;
> this means it will get a new agent ID
> * framework isn't notified about the status of the tasks on the *old* agentID
> until the {{agent_reregister_timeout}} expires (10 mins)
> This isn't necessarily wrong but it is suboptimal: when the agent attempts to
> register with the same UPID that was used by the previous agent instance, we
> know that a *reregistration* attempt for the old <UPID, agentID> pair will
> never be seen. Hence we can declare the old agentID to be gone-forever and
> notify frameworks appropriately, without waiting for the full
> {{agent_reregister_timeout}} to expire.
> Note that we already implement the proposed behavior for the case when the
> master does *not* failover
> (https://github.com/apache/mesos/blob/0.28.1/src/master/master.cpp#L4162-L4172).
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)