[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15519026#comment-15519026 ]
Flavio Junqueira commented on ZOOKEEPER-2080: --------------------------------------------- I suspect that {{connectOne}} needs to synchronize on self because it needs a consistent view of {{self.getView()}} and {{self.getLastSeenQuorumVerifier()}}. In fact, one interesting thing is that {{self.getView()}} is declared as: {noformat} public Map<Long,QuorumPeer.QuorumServer> getView() { return Collections.unmodifiableMap(getQuorumVerifier().getAllMembers()); } {noformat} So all we really need is {{self.getLastSeenQuorumVerifier()}}. The root cause of all these deadlocks seems to be that we are trying to get a consistent view of the ensemble and locking {{QuorumPeer}} to guarantee consistency. The complex interdependencies across classes is making it difficult to guarantee that we don't have deadlocks. My suggestion is that we take a different approach. Each class that needs a consistent view of {{self.getLastSeenQuorumVerifier()}} will implement a listener that caches the new value locally, and {{QuorumPeer}} will broadcast changes to the quorum verifier to all listeners. Broadcasting can be done under a lock to prevent races with other operations inside {{QuorumPeer}}. I think that if we do something like this, we will be avoiding the circular dependencies and fixing the deadlocks. The change doesn't seem to be super complex, but I could be wrong, though. > ReconfigRecoveryTest fails intermittently > ----------------------------------------- > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task > Reporter: Ted Yu > Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)