AndyPang created MESOS-7847:
-------------------------------
Summary: Master failover, the 'slaves.recovered' struct contain
unvalid slaveID when slave reregistered.
Key: MESOS-7847
URL: https://issues.apache.org/jira/browse/MESOS-7847
Project: Mesos
Issue Type: Bug
Components: master
Affects Versions: 1.0.0
Environment: os: ubuntu 14.04 mesos-1.0.0 version
mesos master(2) + mesos agent(1)
Reporter: AndyPang
we run two mesos-masters and one mesos-slave, in order to test mesos HA. We do
as follow steps:
1. startup master1(leader), master2(follower) and agent, the agent successful
registerd to master1;
2. shutdown agent, after 75s(the master and agent ping-pong is 15s*5) shutdown
master1, as a result master1 remove agent from 'registry' and send
'ShutdownMessage' to agent, but agent process have terminated, so it can't
receive this message;
3. master1 is terminated and master2 is leader now, it recovered from registry
with 0 agent, meanwhile the agent is restarted, the agent 'reregisteded' to
master2 is success.
So the issue is when master recovered from registry with 0 agent, the
'slaves.recovered' strcut don't contain this Slave, it should not admit
reregistered success, maybe should send 'ShutdownMessage' to agent.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)