[ 
https://issues.apache.org/jira/browse/MESOS-4546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124279#comment-15124279
 ] 

Neil Conway commented on MESOS-4546:
------------------------------------

Okay, my understanding of the situation here:

* {{zookeeper_init}} is given a list of hostnames. It resolves them to IP 
addresses, then returns {{NULL}} (success).
* Internally, the Zk session loops over the IPs, trying to connect to them but 
not trying to re-resolve the hostnames.
* {{ZooKeeperProcess}} sets a session timeout (10 seconds by default), but this 
doesn't apply because the session hasn't yet been established.
* {{ZooKeeperProcess}} is also prepared to retry {{zookeeper_init}} for up to 
10 minutes, but this doesn't apply because {{zookeeper_init}} returns success
* {{GroupProcess}} has a "reconnect timer" that is used to expire and forcibly 
retry _reconnection_ attempts that don't succeed within the session timeout (10 
seconds). However, this doesn't apply because we only start the reconnect timer 
when _reconnecting_, not when trying to establish the initial connection to 
ZooKeeper for a given {{GroupProcess}}.

Proposed fix is to adjust {{GroupProcess}} so that we start the reconnect timer 
immediately, as soon as we make the first connection attempt to Zk.

> Mesos Agents needs to re-resolve hosts in zk string on leader change / 
> failure to connect
> -----------------------------------------------------------------------------------------
>
>                 Key: MESOS-4546
>                 URL: https://issues.apache.org/jira/browse/MESOS-4546
>             Project: Mesos
>          Issue Type: Bug
>          Components: slave
>            Reporter: Cody Maloney
>            Assignee: Neil Conway
>            Priority: Blocker
>              Labels: mesosphere
>
> Sample Mesos Agent log: 
> https://gist.github.com/brndnmtthws/fb846fa988487250a809
> Note, zookeeper has a function to change the list of servers at runtime: 
> https://github.com/apache/zookeeper/blob/735ea78909e67c648a4978c8d31d63964986af73/src/c/src/zookeeper.c#L1207-L1232
> This comes up when using an AWS AutoScalingGroup for managing the set of 
> masters. 
> The agent when it comes up the first time, resolves the zk:// string. Once 
> all the hosts that were in the original string fail (Each fails, is replaced 
> by a new machine, which has the same DNS name), the agent just keeps spinning 
> in an internal loop, never re-resolving the DNS names.
> Two solutions I see are 
> 1. Update the list of servers / re-resolve
> 2. Have the agent detect it hasn't connected recently, and kill itself (Which 
> will force a re-resolution when the agent starts back up)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to