[
https://issues.apache.org/jira/browse/MESOS-4546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124279#comment-15124279
]
Neil Conway commented on MESOS-4546:
------------------------------------
Okay, my understanding of the situation here:
* {{zookeeper_init}} is given a list of hostnames. It resolves them to IP
addresses, then returns {{NULL}} (success).
* Internally, the Zk session loops over the IPs, trying to connect to them but
not trying to re-resolve the hostnames.
* {{ZooKeeperProcess}} sets a session timeout (10 seconds by default), but this
doesn't apply because the session hasn't yet been established.
* {{ZooKeeperProcess}} is also prepared to retry {{zookeeper_init}} for up to
10 minutes, but this doesn't apply because {{zookeeper_init}} returns success
* {{GroupProcess}} has a "reconnect timer" that is used to expire and forcibly
retry _reconnection_ attempts that don't succeed within the session timeout (10
seconds). However, this doesn't apply because we only start the reconnect timer
when _reconnecting_, not when trying to establish the initial connection to
ZooKeeper for a given {{GroupProcess}}.
Proposed fix is to adjust {{GroupProcess}} so that we start the reconnect timer
immediately, as soon as we make the first connection attempt to Zk.
> Mesos Agents needs to re-resolve hosts in zk string on leader change /
> failure to connect
> -----------------------------------------------------------------------------------------
>
> Key: MESOS-4546
> URL: https://issues.apache.org/jira/browse/MESOS-4546
> Project: Mesos
> Issue Type: Bug
> Components: slave
> Reporter: Cody Maloney
> Assignee: Neil Conway
> Priority: Blocker
> Labels: mesosphere
>
> Sample Mesos Agent log:
> https://gist.github.com/brndnmtthws/fb846fa988487250a809
> Note, zookeeper has a function to change the list of servers at runtime:
> https://github.com/apache/zookeeper/blob/735ea78909e67c648a4978c8d31d63964986af73/src/c/src/zookeeper.c#L1207-L1232
> This comes up when using an AWS AutoScalingGroup for managing the set of
> masters.
> The agent when it comes up the first time, resolves the zk:// string. Once
> all the hosts that were in the original string fail (Each fails, is replaced
> by a new machine, which has the same DNS name), the agent just keeps spinning
> in an internal loop, never re-resolving the DNS names.
> Two solutions I see are
> 1. Update the list of servers / re-resolve
> 2. Have the agent detect it hasn't connected recently, and kill itself (Which
> will force a re-resolution when the agent starts back up)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)