[jira] [Commented] (MESOS-5396) After failover, master does not remove agents with same UPID.

Yan Xu (JIRA) Fri, 23 Jun 2017 15:45:56 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16061604#comment-16061604
 ]


Yan Xu commented on MESOS-5396:
-------------------------------

As noted in MESOS-6223, even when MESOS-6223 is merged, there's still a chance 
for the agent to recover as a new agent after reboot, if the agent info has 
changed during the reboot.

Therefore, even if this may affect the ticket's priority, this condition is 
still valid. This also includes cases where the agent restarted with new ID 
without host reboots (latest symlink removed).

> After failover, master does not remove agents with same UPID.
> -------------------------------------------------------------
>
>                 Key: MESOS-5396
>                 URL: https://issues.apache.org/jira/browse/MESOS-5396
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>            Reporter: Neil Conway
>            Assignee: Neil Conway
>            Priority: Critical
>              Labels: mesosphere
>
> Scenario:
> * master fails over
> * an agent host is restarted; the agent attempts to *register* (not 
> reregister) with Mesos using the same UPID as the previous agent instance; 
> this means it will get a new agent ID
> * framework isn't notified about the status of the tasks on the *old* agentID 
> until the {{agent_reregister_timeout}} expires (10 mins)
> This isn't necessarily wrong but it is suboptimal: when the agent attempts to 
> register with the same UPID that was used by the previous agent instance, we 
> know that a *reregistration* attempt for the old <UPID, agentID> pair will 
> never be seen. Hence we can declare the old agentID to be gone-forever and 
> notify frameworks appropriately, without waiting for the full 
> {{agent_reregister_timeout}} to expire.
> Note that we already implement the proposed behavior for the case when the 
> master does *not* failover 
> (https://github.com/apache/mesos/blob/0.28.1/src/master/master.cpp#L4162-L4172).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (MESOS-5396) After failover, master does not remove agents with same UPID.

Reply via email to