[ 
https://issues.apache.org/jira/browse/MESOS-714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-714:
-----------------------------

      Description: 
The following sequence of events happened in production at Twitter.

--> Slave registered with master A
--> A sent an ACK for registration but died immediately (user restart)
--> Slave detected a new master B and sent a re-register request
--> Slave received the ACK from A now.
--> The bug here is that the slave accepted this ACK even though it was not 
from master B.
--> Master B ignored the re-register request because it didn't know it was the 
master yet!
--> Slave never re-tried its registration because it thinks its registered with 
B.

At this point slave thinks it is registered but the master (B) has no idea of 
it!

Fix: Slaves should check that (re-)registered messages are from the expected 
master pid and if not just ignore them.

  was:
The following sequence of events happened in production at Twitter.

--> Slave registered with master A
--> A sent an ACK for registration but died immediately (user restart)
--> Slave detected a new master B and sent a re-register request
--> Slave received the ACK from A now.
--> The bug here is that the slave accepted this ACK even though it was not 
from master B.
--> Master B ignored the re-register request because it didn't know it was the 
master yet!
--> Slave never re-tried its registration because it thinks its registered with 
B.

At this point slave thinks it is registered but the master (B) has no idea of 
it!

Fix: Slaves should check that (re-)registered messages are from the expected 
master pid.

    Fix Version/s:     (was: 0.15.0)
                   0.14.1
         Assignee: Benjamin Mahler  (was: Vinod Kone)

> Slave should check if the (re-)registered is from the expected master
> ---------------------------------------------------------------------
>
>                 Key: MESOS-714
>                 URL: https://issues.apache.org/jira/browse/MESOS-714
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Vinod Kone
>            Assignee: Benjamin Mahler
>             Fix For: 0.14.1
>
>
> The following sequence of events happened in production at Twitter.
> --> Slave registered with master A
> --> A sent an ACK for registration but died immediately (user restart)
> --> Slave detected a new master B and sent a re-register request
> --> Slave received the ACK from A now.
> --> The bug here is that the slave accepted this ACK even though it was not 
> from master B.
> --> Master B ignored the re-register request because it didn't know it was 
> the master yet!
> --> Slave never re-tried its registration because it thinks its registered 
> with B.
> At this point slave thinks it is registered but the master (B) has no idea of 
> it!
> Fix: Slaves should check that (re-)registered messages are from the expected 
> master pid and if not just ignore them.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to