[
https://issues.apache.org/jira/browse/ZOOKEEPER-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640957#comment-16640957
]
maoling edited comment on ZOOKEEPER-2778 at 10/7/18 3:02 AM:
-------------------------------------------------------------
[~hanm]
Are you still working on this?Could I pick up it(smirk)?
was (Author: maoling):
[~hanm]
Are still working on this?Could I pick up it?
> Potential server deadlock between follower sync with leader and follower
> receiving external connection requests.
> ----------------------------------------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-2778
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2778
> Project: ZooKeeper
> Issue Type: Bug
> Components: quorum
> Affects Versions: 3.5.3
> Reporter: Michael Han
> Assignee: Michael Han
> Priority: Critical
>
> It's possible to have a deadlock during recovery phase.
> Found this issue by analyzing thread dumps of "flaky" ReconfigRecoveryTest
> [1]. . Here is a sample thread dump that illustrates the state of the
> execution:
> {noformat}
> [junit] java.lang.Thread.State: BLOCKED
> [junit] at
> org.apache.zookeeper.server.quorum.QuorumPeer.getElectionAddress(QuorumPeer.java:686)
> [junit] at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:265)
> [junit] at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:445)
> [junit] at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:369)
> [junit] at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:642)
> [junit]
> [junit] java.lang.Thread.State: BLOCKED
> [junit] at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:472)
> [junit] at
> org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1438)
> [junit] at
> org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1471)
> [junit] at
> org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:520)
> [junit] at
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:88)
> [junit] at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {noformat}
> The dead lock happens between the quorum peer thread which running the
> follower that doing sync with leader work, and the listener of the qcm of the
> same quorum peer that doing the receiving connection work. Basically to
> finish sync with leader, the follower needs to synchronize on both QV_LOCK
> and the qmc object it owns; while in the receiver thread to finish setup an
> incoming connection the thread needs to synchronize on both the qcm object
> the quorum peer owns, and the same QV_LOCK. It's easy to see the problem here
> is the order of acquiring two locks are different, thus depends on timing /
> actual execution order, two threads might end up acquiring one lock while
> holding another.
> [1]
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentServersAreObserversInNextConfig
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)