Github user mkedwards commented on the issue:

    https://github.com/apache/zookeeper/pull/707
  
    The deadlock, as I understand it, comes from an attempt to take `QV_LOCK` 
in order to access `myElectionAddress` in thread A, while holding the lock on 
the `QuorumCnxManager` object; meanwhile, thread B, which holds the `QV_LOCK` 
while `connectOne()` is running, tries to take the lock on the 
`QuorumCnxManager` object, producing a deadly embrace.  From the Jira:
    
    ```
        [junit]  java.lang.Thread.State: BLOCKED
        [junit]         at  
org.apache.zookeeper.server.quorum.QuorumPeer.getElectionAddress(QuorumPeer.java:686)
        [junit]         at  
org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:265)
        [junit]         at  
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:445)
        [junit]         at  
org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:369)
        [junit]         at  
org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:642)
        [junit] 
    
    
        [junit]  java.lang.Thread.State: BLOCKED
        [junit]         at  
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:472)
        [junit]         at  
org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1438)
        [junit]         at  
org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1471)
        [junit]         at  
org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:520)
        [junit]         at  
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:88)
        [junit]         at  
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
    ```
    
    This patch removes the synchronization on `QV_LOCK` from the address 
accessor paths, where its current effect seems primarily to ensure cross-thread 
visibility; the memory barriers involved in setting/fetching through an 
`AtomicReference` are equally effective at this.  Wrapping a single 
`AtomicReference<AddressTuple>` around all three addresses ensures that you 
don't get interleaved sets when two threads each try to update the address 
fields -- although in the code as I read it, callers are holding the `QV_LOCK` 
anyway.  I can't yet say that I've analyzed all other paths that take 
`QV_LOCK`, but I'm fairly sure that if 
https://github.com/apache/zookeeper/pull/247 would work, this would work at 
least as well, while at least partially addressing concerns about getting 
"outdated" state in read paths.  (If this is a serious concern, then presumably 
we need a separate reader/writer lock that is taken during creation of the 
QuorumPeer and not released until it reaches a "fully configu
 red" state for the first time.)


---

Reply via email to