[ 
https://issues.apache.org/jira/browse/MESOS-530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-530:
-----------------------------

    Description: 
We have seen this in production at Twitter

Timeline of events:
06/26 06:46: Slave host rebooted
06/26 06:46.54: Slave registered (201305082239-1864771594-5050-8729-594) with 
master. But slave never got the ACK. Presumably there were network partition 
issues.
06/26 06:47.21: Slave disconnected from the master and the master removed the 
slave.
06/26 06:47.21: Immediately after, the master registered the slave with a new 
id (201305082239-1864771594-5050-8729-609).

But the slave only received the old id 
(201305082239-1864771594-5050-8729-594)!!! So there is a mismatch between the 
slave id known to the master and the slave!

Currently the slave silently ignores a (re-)registration message from the 
master if it is already (re-)registered. This was originally designed to ignore 
duplicate (re-)registered messages sent by the master. But clearly it doesn't 
catch the above edge case.

  was:
We have seen this in production at Twitter

Timeline of events:
06/26 06:46: Slave host rebooted
06/26 06:46.54: Slave registered (201305082239-1864771594-5050-8729-594) with 
master. But slave never got the ACK. Presumably there were network partition 
issues.
06/26 06:47.21: Slave disconnected from the master and the master removed the 
slave.
06/26 06:47.21: Immediately after, the master registered the slave with a new 
id (201305082239-1864771594-5050-8729-609).

But the slave only received the old id 
(201305082239-1864771594-5050-8729-609)!!! So there is a mismatch between the 
slave id known to the master and the slave!

Currently the slave silently ignores a (re-)registration message from the 
master if it is already (re-)registered. This was originally designed to ignore 
duplicate (re-)registered messages sent by the master. But clearly it doesn't 
catch the above edge case.

    
> A registered slave should check registration id when it receives mulitple 
> re(re-)gistered messages from the master
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-530
>                 URL: https://issues.apache.org/jira/browse/MESOS-530
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.13.0
>            Reporter: Vinod Kone
>            Assignee: Vinod Kone
>             Fix For: 0.13.0
>
>
> We have seen this in production at Twitter
> Timeline of events:
> 06/26 06:46: Slave host rebooted
> 06/26 06:46.54: Slave registered (201305082239-1864771594-5050-8729-594) with 
> master. But slave never got the ACK. Presumably there were network partition 
> issues.
> 06/26 06:47.21: Slave disconnected from the master and the master removed the 
> slave.
> 06/26 06:47.21: Immediately after, the master registered the slave with a new 
> id (201305082239-1864771594-5050-8729-609).
> But the slave only received the old id 
> (201305082239-1864771594-5050-8729-594)!!! So there is a mismatch between the 
> slave id known to the master and the slave!
> Currently the slave silently ignores a (re-)registration message from the 
> master if it is already (re-)registered. This was originally designed to 
> ignore duplicate (re-)registered messages sent by the master. But clearly it 
> doesn't catch the above edge case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to