[
https://issues.apache.org/jira/browse/MESOS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Neil Conway updated MESOS-5396:
-------------------------------
Description:
Scenario:
* master fails over
* an agent host is restarted; the agent attempts to *register* (not reregister)
with Mesos using the same UPID as the previous agent instance; this means it
will get a new agent ID
* framework isn't notified about the status of the tasks on the *old* agentID
until the {{agent_reregister_timeout}} expires (10 mins)
This isn't necessarily wrong but it is suboptimal: when the agent attempts to
register with the same UPID that was used by the previous agent instance, we
know that a *reregistration* attempt for the old <UPID, agentID> pair will
never be seen. Hence we can declare the old agentID to be gone-forever and
notify frameworks appropriately, without waiting for the full
{{agent_reregister_timeout}} to expire.
Note that we already implement the proposed behavior for the case when the
master does *not* failover
(https://github.com/apache/mesos/blob/0.28.1/src/master/master.cpp#L4162-L4172).
was:
Scenario:
* master fails over
* an agent host is restarted; the agent attempts to *register* (not reregister)
with Mesos using the same UPID as the previous agent instance; this means it
will get a new agent ID
* framework isn't notified about the status of the tasks on the *old* slaveID
until the {{agent_reregister_timeout}} expires (10 mins)
This isn't necessarily wrong but it is suboptimal: when the agent attempts to
register with the same UPID that was used by the previous agent instance, we
know that a *reregistration* attempt for the old <UPID, agentID> pair will
never be seen. Hence we can declare the old agentID to be gone-forever and
notify frameworks appropriately, without waiting for the full
{{agent_reregister_timeout}} to expire.
Note that we already implement the proposed behavior for the case when the
master does *not* failover
(https://github.com/apache/mesos/blob/0.28.1/src/master/master.cpp#L4162-L4172).
> After failover, master does not remove agents with same UPID
> ------------------------------------------------------------
>
> Key: MESOS-5396
> URL: https://issues.apache.org/jira/browse/MESOS-5396
> Project: Mesos
> Issue Type: Bug
> Components: master
> Reporter: Neil Conway
> Assignee: Neil Conway
> Priority: Critical
> Labels: mesosphere
>
> Scenario:
> * master fails over
> * an agent host is restarted; the agent attempts to *register* (not
> reregister) with Mesos using the same UPID as the previous agent instance;
> this means it will get a new agent ID
> * framework isn't notified about the status of the tasks on the *old* agentID
> until the {{agent_reregister_timeout}} expires (10 mins)
> This isn't necessarily wrong but it is suboptimal: when the agent attempts to
> register with the same UPID that was used by the previous agent instance, we
> know that a *reregistration* attempt for the old <UPID, agentID> pair will
> never be seen. Hence we can declare the old agentID to be gone-forever and
> notify frameworks appropriately, without waiting for the full
> {{agent_reregister_timeout}} to expire.
> Note that we already implement the proposed behavior for the case when the
> master does *not* failover
> (https://github.com/apache/mesos/blob/0.28.1/src/master/master.cpp#L4162-L4172).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)