[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15519026#comment-15519026
 ] 

Flavio Junqueira commented on ZOOKEEPER-2080:
---------------------------------------------

I suspect that {{connectOne}} needs to synchronize on self because it needs a 
consistent view of {{self.getView()}} and {{self.getLastSeenQuorumVerifier()}}. 
In fact, one interesting thing is that {{self.getView()}} is declared as:

{noformat}
public Map<Long,QuorumPeer.QuorumServer> getView() {
        return Collections.unmodifiableMap(getQuorumVerifier().getAllMembers());
    }
{noformat}

So all we really need is {{self.getLastSeenQuorumVerifier()}}.

The root cause of all these deadlocks seems to be that we are trying to get a 
consistent view of the ensemble and locking {{QuorumPeer}} to guarantee 
consistency. The complex interdependencies across classes is making it 
difficult to guarantee that we don't have deadlocks. My suggestion is that we 
take a different approach. Each class that needs a consistent view of 
{{self.getLastSeenQuorumVerifier()}} will implement a listener that caches the 
new value locally, and {{QuorumPeer}} will broadcast changes to the quorum 
verifier to all listeners. Broadcasting can be done under a lock to prevent 
races with other operations inside {{QuorumPeer}}. I think that if we do 
something like this, we will be avoiding the circular dependencies and fixing 
the deadlocks. The change doesn't seem to be super complex, but I could be 
wrong, though.

> ReconfigRecoveryTest fails intermittently
> -----------------------------------------
>
>                 Key: ZOOKEEPER-2080
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
>             Project: ZooKeeper
>          Issue Type: Sub-task
>            Reporter: Ted Yu
>            Assignee: Michael Han
>             Fix For: 3.5.3, 3.6.0
>
>         Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to