[ https://issues.apache.org/jira/browse/MESOS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neil Conway updated MESOS-5396: ------------------------------- Description: Scenario: * master fails over * an agent host is restarted; the agent attempts to *register* (not reregister) with Mesos using the same UPID as the previous agent instance; this means it will get a new agent ID * framework isn't notified about the status of the tasks on the *old* agentID until the {{agent_reregister_timeout}} expires (10 mins) This isn't necessarily wrong but it is suboptimal: when the agent attempts to register with the same UPID that was used by the previous agent instance, we know that a *reregistration* attempt for the old <UPID, agentID> pair will never be seen. Hence we can declare the old agentID to be gone-forever and notify frameworks appropriately, without waiting for the full {{agent_reregister_timeout}} to expire. Note that we already implement the proposed behavior for the case when the master does *not* failover (https://github.com/apache/mesos/blob/0.28.1/src/master/master.cpp#L4162-L4172). was: Scenario: * master fails over * an agent host is restarted; the agent attempts to *register* (not reregister) with Mesos using the same UPID as the previous agent instance; this means it will get a new agent ID * framework isn't notified about the status of the tasks on the *old* slaveID until the {{agent_reregister_timeout}} expires (10 mins) This isn't necessarily wrong but it is suboptimal: when the agent attempts to register with the same UPID that was used by the previous agent instance, we know that a *reregistration* attempt for the old <UPID, agentID> pair will never be seen. Hence we can declare the old agentID to be gone-forever and notify frameworks appropriately, without waiting for the full {{agent_reregister_timeout}} to expire. Note that we already implement the proposed behavior for the case when the master does *not* failover (https://github.com/apache/mesos/blob/0.28.1/src/master/master.cpp#L4162-L4172). > After failover, master does not remove agents with same UPID > ------------------------------------------------------------ > > Key: MESOS-5396 > URL: https://issues.apache.org/jira/browse/MESOS-5396 > Project: Mesos > Issue Type: Bug > Components: master > Reporter: Neil Conway > Assignee: Neil Conway > Priority: Critical > Labels: mesosphere > > Scenario: > * master fails over > * an agent host is restarted; the agent attempts to *register* (not > reregister) with Mesos using the same UPID as the previous agent instance; > this means it will get a new agent ID > * framework isn't notified about the status of the tasks on the *old* agentID > until the {{agent_reregister_timeout}} expires (10 mins) > This isn't necessarily wrong but it is suboptimal: when the agent attempts to > register with the same UPID that was used by the previous agent instance, we > know that a *reregistration* attempt for the old <UPID, agentID> pair will > never be seen. Hence we can declare the old agentID to be gone-forever and > notify frameworks appropriately, without waiting for the full > {{agent_reregister_timeout}} to expire. > Note that we already implement the proposed behavior for the case when the > master does *not* failover > (https://github.com/apache/mesos/blob/0.28.1/src/master/master.cpp#L4162-L4172). -- This message was sent by Atlassian JIRA (v6.3.4#6332)