Github user mkedwards commented on a diff in the pull request:
https://github.com/apache/zookeeper/pull/707#discussion_r235460740
--- Diff:
zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumPeer.java
---
@@ -1556,32 +1573,50 @@ public String getNextDynamicConfigFilename() {
}
public void setLastSeenQuorumVerifier(QuorumVerifier qv, boolean
writeToDisk){
- synchronized (QV_LOCK) {
- if (lastSeenQuorumVerifier != null &&
lastSeenQuorumVerifier.getVersion() > qv.getVersion()) {
- LOG.error("setLastSeenQuorumVerifier called with stale
config " + qv.getVersion() +
- ". Current version: " +
quorumVerifier.getVersion());
+ // If qcm is non-null, we may call qcm.connectOne(), which will
take the lock on qcm
+ // and then take QV_LOCK. Take the locks in the same order to
ensure that we don't
+ // deadlock against other callers of connectOne(). If qcmRef gets
set in another
+ // thread while we're inside the synchronized block, that does no
harm; if we didn't
+ // take a lock on qcm (because it was null when we sampled it), we
won't call
+ // connectOne() on it. (Use of an AtomicReference is enough to
guarantee visibility
+ // of updates that provably happen in another thread before
entering this method.)
+ QuorumCnxManager qcm = qcmRef.get();
+ Object outerLockObject = (qcm != null) ? qcm : QV_LOCK;
--- End diff --
Yes; I'd like to get confirmation from @castuardo and his team that this
fully solves the lock nesting problem. But if you look at the flow of the
code, I think you'll see that the only way _this_ code can take the lock on a
{{QuorumCnxManager}} is via {{qcm.connectOne()}}. Since Java locks are
reentrant, we're safe ensuring that the nesting is always qcm -> QV_LOCK -> qcm
-> QV_LOCK if qcm is non-null and QV_LOCK -> QV_LOCK if qcm is null.
---