Cody Maloney created MESOS-4546:
-----------------------------------

             Summary: Mesos Agents needs to re-resolve hosts in zk string on 
leader change / failure to connect
                 Key: MESOS-4546
                 URL: https://issues.apache.org/jira/browse/MESOS-4546
             Project: Mesos
          Issue Type: Bug
          Components: slave
            Reporter: Cody Maloney
            Assignee: Artem Harutyunyan
            Priority: Blocker


Sample Mesos Agent log: https://gist.github.com/brndnmtthws/fb846fa988487250a809

Note, zookeeper has a function to change the list of servers at runtime: 
https://github.com/apache/zookeeper/blob/735ea78909e67c648a4978c8d31d63964986af73/src/c/src/zookeeper.c#L1207-L1232

This comes up when using an AWS AutoScalingGroup for managing the set of 
masters. 

The agent when it comes up the first time, resolves the zk:// string. Once all 
the hosts that were in the original string fail (Each fails, is replaced by a 
new machine, which has the same DNS name), the agent just keeps spinning in an 
internal loop, never re-resolving the DNS names.

Two solutions I see are 
1. Update the list of servers / re-resolve
2. Have the agent detect it hasn't connected recently, and kill itself (Which 
will force a re-resolution when the agent starts back up)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to