[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16151039#comment-16151039
 ] 

Cesar Stuardo edited comment on ZOOKEEPER-2778 at 9/1/17 7:16 PM:
------------------------------------------------------------------

Hello,

We are computer science PhD students building a Distributed model checking 
tool. We have been able to reproduce this issue too, when the QuorumPeer thread 
is racing with the Listener thread and gets into deadlock by requesting 
QCNXManager lock while holding QV_LOCK (on the other side, Listener thread 
holds QCNXManager lock and requests QV_LOCK). We also see this potential issue 
with the WorkerSender thread while performing toSend -> connectOne (one 
argument, requesting QCNXManager_LOCK) -> connectOne (two arguments, requesting 
QCNXManager_LOCK) -> initiateConnection -> getElectionAddress (requesting 
QV_LOCK) which can also race with the QuorumPeer thread for the same locks.




was (Author: castuardo):
Hello,

We are computer science PhD students building a Distributed model checking 
tool. We have been able to reproduce this issue too, when the QuorumPeer thread 
is racing with the Listener thread and gets into deadlock by requesting 
QCNXManager lock while holding QV_LOCK (on the other side, Listener thread 
holds QCNXManager lock and requests QV_LOCK). We also see this potential issue 
with the WorkerSender thread while performing toSend -> connectOne (one 
argument, requesting QCNXManager_LOCK) -> connectAll -> connectOne (two 
arguments, requesting QCNXManager_LOCK) -> initiateConnection -> 
getElectionAddress (requesting QV_LOCK) which can also race with the QuorumPeer 
thread for the same locks.



> Potential server deadlock between follower sync with leader and follower 
> receiving external connection requests.
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2778
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2778
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.5.3
>            Reporter: Michael Han
>            Assignee: Michael Han
>            Priority: Critical
>
> It's possible to have a deadlock during recovery phase. 
> Found this issue by analyzing thread dumps of "flaky" ReconfigRecoveryTest 
> [1]. . Here is a sample thread dump that illustrates the state of the 
> execution:
> {noformat}
>     [junit]  java.lang.Thread.State: BLOCKED
>     [junit]         at  
> org.apache.zookeeper.server.quorum.QuorumPeer.getElectionAddress(QuorumPeer.java:686)
>     [junit]         at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:265)
>     [junit]         at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:445)
>     [junit]         at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:369)
>     [junit]         at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:642)
>     [junit] 
>     [junit]  java.lang.Thread.State: BLOCKED
>     [junit]         at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:472)
>     [junit]         at  
> org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1438)
>     [junit]         at  
> org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1471)
>     [junit]         at  
> org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:520)
>     [junit]         at  
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:88)
>     [junit]         at  
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {noformat}
> The dead lock happens between the quorum peer thread which running the 
> follower that doing sync with leader work, and the listener of the qcm of the 
> same quorum peer that doing the receiving connection work. Basically to 
> finish sync with leader, the follower needs to synchronize on both QV_LOCK 
> and the qmc object it owns; while in the receiver thread to finish setup an 
> incoming connection the thread needs to synchronize on both the qcm object 
> the quorum peer owns, and the same QV_LOCK. It's easy to see the problem here 
> is the order of acquiring two locks are different, thus depends on timing / 
> actual execution order, two threads might end up acquiring one lock while 
> holding another.
> [1] 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentServersAreObserversInNextConfig



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to