Github user hanm commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/92#discussion_r97451281 --- Diff: src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java --- @@ -468,31 +469,33 @@ synchronized private boolean connectOne(long sid, InetSocketAddress electionAddr */ synchronized void connectOne(long sid){ + connectOne(sid, self.getLastSeenQuorumVerifier()); + } + + synchronized void connectOne(long sid, QuorumVerifier lastSeenQV){ if (senderWorkerMap.get(sid) != null) { - LOG.debug("There is a connection already for server " + sid); - return; + LOG.debug("There is a connection already for server " + sid); + return; } - synchronized(self) { - boolean knownId = false; - // Resolve hostname for the remote server before attempting to - // connect in case the underlying ip address has changed. - self.recreateSocketAddresses(sid); - if (self.getView().containsKey(sid)) { - knownId = true; - if (connectOne(sid, self.getView().get(sid).electionAddr)) - return; - } - if (self.getLastSeenQuorumVerifier()!=null && self.getLastSeenQuorumVerifier().getAllMembers().containsKey(sid) - && (!knownId || (self.getLastSeenQuorumVerifier().getAllMembers().get(sid).electionAddr != - self.getView().get(sid).electionAddr))) { - knownId = true; - if (connectOne(sid, self.getLastSeenQuorumVerifier().getAllMembers().get(sid).electionAddr)) - return; - } - if (!knownId) { - LOG.warn("Invalid server id: " + sid); + boolean knownId = false; + // Resolve hostname for the remote server before attempting to + // connect in case the underlying ip address has changed. + self.recreateSocketAddresses(sid); + if (self.getView().containsKey(sid)) { --- End diff -- @shralex Thanks for review comments! Made two changes: * Refactored the code to reuse getView results. This view is not passed in as I thought that's simplified caller site. * This code block inside connectOne is now synchronized with the same lock that protecting other view / quorum verifiers of the same QuorumPeer. I think this makes the code block semantically equivalent to the previous code block before this change, where the code block was synchronizing on the whole QuorumPeer 'self' with the intention that during the entire execution of connectOne, accesses to configs are protected. I did not add any comments as with the explicit synchronizing block, the semantic should be self explanatory. My stress tests look good so far with latest changes.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---