Neil Conway updated MESOS-5396:
    Shepherd: Vinod Kone

> After failover, master does not remove agents with same UPID
> ------------------------------------------------------------
>                 Key: MESOS-5396
>                 URL: https://issues.apache.org/jira/browse/MESOS-5396
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>            Reporter: Neil Conway
>            Assignee: Neil Conway
>            Priority: Critical
>              Labels: mesosphere
> Scenario:
> * master fails over
> * an agent host is restarted; the agent attempts to *register* (not 
> reregister) with Mesos using the same UPID as the previous agent instance; 
> this means it will get a new agent ID
> * framework isn't notified about the status of the tasks on the *old* agentID 
> until the {{agent_reregister_timeout}} expires (10 mins)
> This isn't necessarily wrong but it is suboptimal: when the agent attempts to 
> register with the same UPID that was used by the previous agent instance, we 
> know that a *reregistration* attempt for the old <UPID, agentID> pair will 
> never be seen. Hence we can declare the old agentID to be gone-forever and 
> notify frameworks appropriately, without waiting for the full 
> {{agent_reregister_timeout}} to expire.
> Note that we already implement the proposed behavior for the case when the 
> master does *not* failover 
> (https://github.com/apache/mesos/blob/0.28.1/src/master/master.cpp#L4162-L4172).

This message was sent by Atlassian JIRA

Reply via email to