[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15401425#comment-15401425 ]
Raul Gutierrez Segales commented on ZOOKEEPER-2080: --------------------------------------------------- [~hanm]: thanks for tracking this down and for the patch! A few questions/asks, looking at the code: {code} Election election = null; synchronized(self) { try { rqv = self.configFromString(new String(b)); QuorumVerifier curQV = self.getQuorumVerifier(); if (rqv.getVersion() > curQV.getVersion()) { LOG.info("{} Received version: {} my version: {}", self.getId(), Long.toHexString(rqv.getVersion()), Long.toHexString(self.getQuorumVerifier().getVersion())); if (self.getPeerState() == ServerState.LOOKING) { LOG.debug("Invoking processReconfig(), state: {}", self.getServerState()); self.processReconfig(rqv, null, null, false); if (!rqv.equals(curQV)) { LOG.info("restarting leader election"); // Signaling quorum peer to restart leader election. self.shuttingDownLE = true; // Get a hold of current leader election object of quorum peer, // so we can clean it up later without holding the lock of quorum // peer. If we shutdown current leader election we will run into // potential deadlock. See ZOOKEEPER-2080 for more details. election = self.getElectionAlg(); } } else { LOG.debug("Skip processReconfig(), state: {}", self.getServerState()); } } } catch (IOException e) { LOG.error("Something went wrong while processing config received from {}", response.sid); } catch (ConfigException e) { LOG.error("Something went wrong while processing config received from {}", response.sid); } } {code} Do we really need to synchronize around self for the first part: {code} rqv = self.configFromString(new String(b)); QuorumVerifier curQV = self.getQuorumVerifier(); if (rqv.getVersion() > curQV.getVersion()) { {code} ? Sounds like that can be done without synchronizing... no? Also, given you've spent a good amount of cycles untangling the dependencies around locking QuorumPeer, could you maybe add a comment before the synchronize(self) block noting why it is needed and who else might be contending for this lock. Thanks so much! I think unit testing these things is a bit tricky, we might get a better return by just keeping better comments around synchronized regions and generally keeping them well maintained (imho). So I am happy to +1 without tests. > ReconfigRecoveryTest fails intermittently > ----------------------------------------- > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task > Reporter: Ted Yu > Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)