Github user mkedwards commented on the issue:
https://github.com/apache/zookeeper/pull/707
The deadlock, as I understand it, comes from an attempt to take `QV_LOCK`
in order to access `myElectionAddress` in thread A, while holding the lock on
the `QuorumCnxManager` object; meanwhile, thread B, which holds the `QV_LOCK`
while `connectOne()` is running, tries to take the lock on the
`QuorumCnxManager` object, producing a deadly embrace. From the Jira:
```
[junit] java.lang.Thread.State: BLOCKED
[junit] at
org.apache.zookeeper.server.quorum.QuorumPeer.getElectionAddress(QuorumPeer.java:686)
[junit] at
org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:265)
[junit] at
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:445)
[junit] at
org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:369)
[junit] at
org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:642)
[junit]
[junit] java.lang.Thread.State: BLOCKED
[junit] at
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:472)
[junit] at
org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1438)
[junit] at
org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1471)
[junit] at
org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:520)
[junit] at
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:88)
[junit] at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
```
This patch removes the synchronization on `QV_LOCK` from the address
accessor paths, where its current effect seems primarily to ensure cross-thread
visibility; the memory barriers involved in setting/fetching through an
`AtomicReference` are equally effective at this. Wrapping a single
`AtomicReference<AddressTuple>` around all three addresses ensures that you
don't get interleaved sets when two threads each try to update the address
fields -- although in the code as I read it, callers are holding the `QV_LOCK`
anyway. I can't yet say that I've analyzed all other paths that take
`QV_LOCK`, but I'm fairly sure that if
https://github.com/apache/zookeeper/pull/247 would work, this would work at
least as well, while at least partially addressing concerns about getting
"outdated" state in read paths. (If this is a serious concern, then presumably
we need a separate reader/writer lock that is taken during creation of the
QuorumPeer and not released until it reaches a "fully configu
red" state for the first time.)
---