[
https://issues.apache.org/jira/browse/MESOS-4546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Artem Harutyunyan updated MESOS-4546:
-------------------------------------
Assignee: Neil Conway (was: Artem Harutyunyan)
> Mesos Agents needs to re-resolve hosts in zk string on leader change /
> failure to connect
> -----------------------------------------------------------------------------------------
>
> Key: MESOS-4546
> URL: https://issues.apache.org/jira/browse/MESOS-4546
> Project: Mesos
> Issue Type: Bug
> Components: slave
> Reporter: Cody Maloney
> Assignee: Neil Conway
> Priority: Blocker
> Labels: mesosphere
>
> Sample Mesos Agent log:
> https://gist.github.com/brndnmtthws/fb846fa988487250a809
> Note, zookeeper has a function to change the list of servers at runtime:
> https://github.com/apache/zookeeper/blob/735ea78909e67c648a4978c8d31d63964986af73/src/c/src/zookeeper.c#L1207-L1232
> This comes up when using an AWS AutoScalingGroup for managing the set of
> masters.
> The agent when it comes up the first time, resolves the zk:// string. Once
> all the hosts that were in the original string fail (Each fails, is replaced
> by a new machine, which has the same DNS name), the agent just keeps spinning
> in an internal loop, never re-resolving the DNS names.
> Two solutions I see are
> 1. Update the list of servers / re-resolve
> 2. Have the agent detect it hasn't connected recently, and kill itself (Which
> will force a re-resolution when the agent starts back up)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)