Vinod Kone created MESOS-304:
--------------------------------

             Summary: Master should register a slave only after it confirms it 
can talk to the slave
                 Key: MESOS-304
                 URL: https://issues.apache.org/jira/browse/MESOS-304
             Project: Mesos
          Issue Type: Improvement
            Reporter: Vinod Kone


We have seen this issue from users running on EC2 and also at Twitter.

The crux of the issue is that, the master starts offering the resources of a 
slave as soon as it gets a Register message. If for some reason the master --> 
slave connection is not viable (e.g. slave used its private ip address, DNS 
failures), we end up in a loop as follows:

--> Slave sends Register message to master
--> Master accepts it and offers resources to the framework
--> The slave health checks to the slave keeps failing
--> Framework launches tasks on this slave, which would be dropped on the floor
--> After health check timeout (>60s), master disconnects the slave
--> Slave sends a Register message again.
--> Repeat

One way to solve this problem is to do a 3-way handshake for registration.

This should also be done for framework registration.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to