[
https://issues.apache.org/jira/browse/MESOS-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14522127#comment-14522127
]
Raul Gutierrez Segales commented on MESOS-2681:
-----------------------------------------------
yeah, they only know about the old servers and won't lookup the new ones until
a new handler is created. but this won't happen until you get an expiration,
which can only happen if you manage to reach one of the old servers (which are
all gone by now).
this quite a corner case when mixing DNS RR and swapping out the whole ensemble
(with session timeouts being so high that they don't end up expiring in between
the swaps), so i am skeptical about adding a specific handler for this in the
mesos codebase. thoughts on not doing in a convoluted way which ends up
reinventing the zk session state machine? maybe documenting this behavior for
mesos operators would be enough?
cc: [~yasumoto]
> Slave process must restart to update ensemble members
> -----------------------------------------------------
>
> Key: MESOS-2681
> URL: https://issues.apache.org/jira/browse/MESOS-2681
> Project: Mesos
> Issue Type: Bug
> Components: slave
> Reporter: Joe Smith
>
> Right now, if a ZooKeeper ensemble has (for instance) more observers added to
> it, the Mesos Slaves will not see them, and continue to attempt to connect to
> only the original members. A restart of the slave process is required to call
> {{getaddrinfo}} again and enumerate the list of hosts in the ensemble.
> Subsequent {{getaddrinfo}} calls _will only_ occur when {{zookeeper_init()}}
> is called again, that is to say: when the old session expires and you need to
> create a new one. If you swap all hosts in your ensemble too fast, without
> permitting time for old sessions to expire, you'd end up with clients looping
> forever, trying to connect to the old servers in order to get its old session
> expired.
> This is best tracked by ZOOKEEPER-1998, where these is some discussion about
> a necessary improvement to the implementation already in the 3.5.x branch, or
> putting this functionality (debatably a feature vs. fixing a bug) in 3.4.x.
> (Thanks to [~rgs] for reviewing this as well)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)