Vinod Kone created MESOS-530:
--------------------------------
Summary: A registered slave should check registration id when it
receives mulitple re(re-)gistered messages from the master
Key: MESOS-530
URL: https://issues.apache.org/jira/browse/MESOS-530
Project: Mesos
Issue Type: Bug
Affects Versions: 0.13.0
Reporter: Vinod Kone
Assignee: Vinod Kone
Fix For: 0.13.0
We have seen this in production at Twitter
Timeline of events:
06/26 06:46: Slave host rebooted
06/26 06:46.54: Slave registered (201305082239-1864771594-5050-8729-594) with
master. But slave never got the ACK. Presumably there were network partition
issues.
06/26 06:47.21: Slave disconnected from the master and the master removed the
slave.
06/26 06:47.21: Immediately after, the master registered the slave with a new
id (201305082239-1864771594-5050-8729-609).
But the slave only received the old id
(201305082239-1864771594-5050-8729-609)!!! So there is a mismatch between the
slave id known to the master and the slave!
Currently the slave silently ignores a (re-)registration message from the
master if it is already (re-)registered. This was originally designed to ignore
duplicate (re-)registered messages sent by the master. But clearly it doesn't
catch the above edge case.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira