[jira] [Comment Edited] (ZOOKEEPER-2778) Potential server deadlock between follower sync with leader and follower receiving external connection requests.

2017-09-06 Thread Cesar Stuardo (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156297#comment-16156297
 ] 

Cesar Stuardo edited comment on ZOOKEEPER-2778 at 9/7/17 1:42 AM:
--

Hey,

Happy to help [~hanm]! Are we correct about the issue (regarding the path)?


was (Author: castuardo):
Hey,

Happy to help! Are we correct about the issue (regarding the path)?

> Potential server deadlock between follower sync with leader and follower 
> receiving external connection requests.
> 
>
> Key: ZOOKEEPER-2778
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2778
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.5.3
>Reporter: Michael Han
>Assignee: Michael Han
>Priority: Critical
>
> It's possible to have a deadlock during recovery phase. 
> Found this issue by analyzing thread dumps of "flaky" ReconfigRecoveryTest 
> [1]. . Here is a sample thread dump that illustrates the state of the 
> execution:
> {noformat}
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.getElectionAddress(QuorumPeer.java:686)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:265)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:445)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:369)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:642)
> [junit] 
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:472)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1438)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1471)
> [junit] at  
> org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:520)
> [junit] at  
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:88)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {noformat}
> The dead lock happens between the quorum peer thread which running the 
> follower that doing sync with leader work, and the listener of the qcm of the 
> same quorum peer that doing the receiving connection work. Basically to 
> finish sync with leader, the follower needs to synchronize on both QV_LOCK 
> and the qmc object it owns; while in the receiver thread to finish setup an 
> incoming connection the thread needs to synchronize on both the qcm object 
> the quorum peer owns, and the same QV_LOCK. It's easy to see the problem here 
> is the order of acquiring two locks are different, thus depends on timing / 
> actual execution order, two threads might end up acquiring one lock while 
> holding another.
> [1] 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentServersAreObserversInNextConfig



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2778) Potential server deadlock between follower sync with leader and follower receiving external connection requests.

2017-09-06 Thread Cesar Stuardo (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156297#comment-16156297
 ] 

Cesar Stuardo commented on ZOOKEEPER-2778:
--

Hey,

Happy to help! Are we correct about the issue (regarding the path)?

> Potential server deadlock between follower sync with leader and follower 
> receiving external connection requests.
> 
>
> Key: ZOOKEEPER-2778
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2778
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.5.3
>Reporter: Michael Han
>Assignee: Michael Han
>Priority: Critical
>
> It's possible to have a deadlock during recovery phase. 
> Found this issue by analyzing thread dumps of "flaky" ReconfigRecoveryTest 
> [1]. . Here is a sample thread dump that illustrates the state of the 
> execution:
> {noformat}
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.getElectionAddress(QuorumPeer.java:686)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:265)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:445)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:369)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:642)
> [junit] 
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:472)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1438)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1471)
> [junit] at  
> org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:520)
> [junit] at  
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:88)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {noformat}
> The dead lock happens between the quorum peer thread which running the 
> follower that doing sync with leader work, and the listener of the qcm of the 
> same quorum peer that doing the receiving connection work. Basically to 
> finish sync with leader, the follower needs to synchronize on both QV_LOCK 
> and the qmc object it owns; while in the receiver thread to finish setup an 
> incoming connection the thread needs to synchronize on both the qcm object 
> the quorum peer owns, and the same QV_LOCK. It's easy to see the problem here 
> is the order of acquiring two locks are different, thus depends on timing / 
> actual execution order, two threads might end up acquiring one lock while 
> holding another.
> [1] 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentServersAreObserversInNextConfig



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ZOOKEEPER-2888) Reconfig Command Isolates One of the Nodes when All Ports Change

2017-09-01 Thread Cesar Stuardo (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cesar Stuardo updated ZOOKEEPER-2888:
-
Description: 
When we run our Distributed system Model Checking (DMCK) in ZooKeeper v3.5.3 by 
following the workload (complete details attached):

1. start a 5 node cluster (all nodes know each other).
2. wait for the cluster to reach a steady state.
3. issue reconfig command which does not add or remove nodes but changes all 
the ports of the existing cluster (no role change either). 

We observer that in some situations, one of the followers my end up isolated 
since the other nodes change their ports and end up setting up new connections. 
The consequence is similar to the one at 
[ZK-2865|https://issues.apache.org/jira/browse/ZOOKEEPER-2865?jql=] but the 
scenario is different.

We provide further details in the attached document.



  was:
When we run our Distributed system Model Checking (DMCK) in ZooKeeper v3.5.3
by following the workload (complete details attached):
1. start a 5 node cluster (all nodes know each other).
2. wait for the cluster to reach a steady state.
3. issue reconfig command which does not add or remove nodes but changes all 
the ports of the existing cluster (no role change either). 

We observer that in some situations, one of the followers my end up isolated 
since the other nodes change their ports and end up setting up new connections. 
The consequence is similar to the one at 
[ZK-2865|https://issues.apache.org/jira/browse/ZOOKEEPER-2865?jql=] but the 
scenario is different.

We provide further details in the attached document.




> Reconfig Command Isolates One of the Nodes when All Ports Change
> 
>
> Key: ZOOKEEPER-2888
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2888
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.5.3
>Reporter: Cesar Stuardo
> Attachments: ZK-2888.pdf
>
>
> When we run our Distributed system Model Checking (DMCK) in ZooKeeper v3.5.3 
> by following the workload (complete details attached):
> 1. start a 5 node cluster (all nodes know each other).
> 2. wait for the cluster to reach a steady state.
> 3. issue reconfig command which does not add or remove nodes but changes all 
> the ports of the existing cluster (no role change either). 
> We observer that in some situations, one of the followers my end up isolated 
> since the other nodes change their ports and end up setting up new 
> connections. The consequence is similar to the one at 
> [ZK-2865|https://issues.apache.org/jira/browse/ZOOKEEPER-2865?jql=] but the 
> scenario is different.
> We provide further details in the attached document.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ZOOKEEPER-2888) Reconfig Command Isolates One of the Nodes when All Ports Change

2017-09-01 Thread Cesar Stuardo (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cesar Stuardo updated ZOOKEEPER-2888:
-
Summary: Reconfig Command Isolates One of the Nodes when All Ports Change  
(was: Reconfig Command Isolates One of the Nodes when all ports change)

> Reconfig Command Isolates One of the Nodes when All Ports Change
> 
>
> Key: ZOOKEEPER-2888
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2888
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.5.3
>Reporter: Cesar Stuardo
> Attachments: ZK-2888.pdf
>
>
> When we run our Distributed system Model Checking (DMCK) in ZooKeeper v3.5.3
> by following the workload (complete details attached):
> 1. start a 5 node cluster (all nodes know each other).
> 2. wait for the cluster to reach a steady state.
> 3. issue reconfig command which does not add or remove nodes but changes all 
> the ports of the existing cluster (no role change either). 
> We observer that in some situations, one of the followers my end up isolated 
> since the other nodes change their ports and end up setting up new 
> connections. The consequence is similar to the one at 
> [ZK-2865|https://issues.apache.org/jira/browse/ZOOKEEPER-2865?jql=] but the 
> scenario is different.
> We provide further details in the attached document.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ZOOKEEPER-2888) Reconfig Command Isolates One of the Nodes when all ports change

2017-09-01 Thread Cesar Stuardo (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cesar Stuardo updated ZOOKEEPER-2888:
-
Summary: Reconfig Command Isolates One of the Nodes when all ports change  
(was: Reconfig Causes Inconsistent Configuration file among the nodes)

> Reconfig Command Isolates One of the Nodes when all ports change
> 
>
> Key: ZOOKEEPER-2888
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2888
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.5.3
>Reporter: Cesar Stuardo
> Attachments: ZK-2888.pdf
>
>
> When we run our Distributed system Model Checking (DMCK) in ZooKeeper v3.5.3
> by following the workload (complete details attached):
> 1. start a 5 node cluster (all nodes know each other).
> 2. wait for the cluster to reach a steady state.
> 3. issue reconfig command which does not add or remove nodes but changes all 
> the ports of the existing cluster (no role change either). 
> We observer that in some situations, one of the followers my end up isolated 
> since the other nodes change their ports and end up setting up new 
> connections. The consequence is similar to the one at 
> [ZK-2865|https://issues.apache.org/jira/browse/ZOOKEEPER-2865?jql=] but the 
> scenario is different.
> We provide further details in the attached document.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ZOOKEEPER-2888) Reconfig Causes Inconsistent Configuration file among the nodes

2017-09-01 Thread Cesar Stuardo (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cesar Stuardo updated ZOOKEEPER-2888:
-
Attachment: ZK-2888.pdf

> Reconfig Causes Inconsistent Configuration file among the nodes
> ---
>
> Key: ZOOKEEPER-2888
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2888
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.5.3
>Reporter: Cesar Stuardo
> Attachments: ZK-2888.pdf
>
>
> When we run our Distributed system Model Checking (DMCK) in ZooKeeper v3.5.3
> by following the workload (complete details attached):
> 1. start a 5 node cluster (all nodes know each other).
> 2. wait for the cluster to reach a steady state.
> 3. issue reconfig command which does not add or remove nodes but changes all 
> the ports of the existing cluster (no role change either). 
> We observer that in some situations, one of the followers my end up isolated 
> since the other nodes change their ports and end up setting up new 
> connections. The consequence is similar to the one at 
> [ZK-2865|https://issues.apache.org/jira/browse/ZOOKEEPER-2865?jql=] but the 
> scenario is different.
> We provide further details in the attached document.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ZOOKEEPER-2888) Reconfig Causes Inconsistent Configuration file among the nodes

2017-09-01 Thread Cesar Stuardo (JIRA)
Cesar Stuardo created ZOOKEEPER-2888:


 Summary: Reconfig Causes Inconsistent Configuration file among the 
nodes
 Key: ZOOKEEPER-2888
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2888
 Project: ZooKeeper
  Issue Type: Bug
  Components: quorum
Affects Versions: 3.5.3
Reporter: Cesar Stuardo


When we run our Distributed system Model Checking (DMCK) in ZooKeeper v3.5.3
by following the workload (complete details attached):
1. start a 5 node cluster (all nodes know each other).
2. wait for the cluster to reach a steady state.
3. issue reconfig command which does not add or remove nodes but changes all 
the ports of the existing cluster (no role change either). 

We observer that in some situations, one of the followers my end up isolated 
since the other nodes change their ports and end up setting up new connections. 
The consequence is similar to the one at 
[ZK-2865|https://issues.apache.org/jira/browse/ZOOKEEPER-2865?jql=] but the 
scenario is different.

We provide further details in the attached document.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2865) Reconfig Causes Inconsistent Configuration file among the nodes

2017-09-01 Thread Cesar Stuardo (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16151051#comment-16151051
 ] 

Cesar Stuardo commented on ZOOKEEPER-2865:
--

Hello Alexander,

In your first comment, you state that

But what's required is for the cluster to be able to recover from this state - 
the server that didn't get the commit in your scenario should find out about 
the new config and eventually join the cluster. If that doesn't happen then 
that potentially is a bug, but its not clear from the description here.


What do you mean by this? In our scenario, the node wont be able to recover 
since the nodes that it knows at startup are not listening in the same ports 
anymore, thus wont get updated. The only solution is admin intervention.


> Reconfig Causes Inconsistent Configuration file among the nodes
> ---
>
> Key: ZOOKEEPER-2865
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2865
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 3.5.3
>Reporter: Jeffrey F. Lukman
>Assignee: Alexander Shraer
>Priority: Trivial
> Fix For: 3.5.4, 3.6.0
>
> Attachments: ZK-2865.pdf
>
>
> When we run our Distributed system Model Checking (DMCK) in ZooKeeper v3.5.3
> by following the workload in ZK-2778:
> - initially start 2 ZooKeeper nodes
> - start 3 new nodes
> - do a reconfiguration (the complete reconfiguration is attached in the 
> document)
> We think our DMCK found this following bug:
> - while one of the just joined nodes has not received the latest 
> configuration update 
> (called as node X), the initial leader node closed its port, 
> therefore causing the node X to be isolated.
> For complete information of the bug, please see the document that is attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (ZOOKEEPER-2778) Potential server deadlock between follower sync with leader and follower receiving external connection requests.

2017-09-01 Thread Cesar Stuardo (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16151039#comment-16151039
 ] 

Cesar Stuardo edited comment on ZOOKEEPER-2778 at 9/1/17 7:16 PM:
--

Hello,

We are computer science PhD students building a Distributed model checking 
tool. We have been able to reproduce this issue too, when the QuorumPeer thread 
is racing with the Listener thread and gets into deadlock by requesting 
QCNXManager lock while holding QV_LOCK (on the other side, Listener thread 
holds QCNXManager lock and requests QV_LOCK). We also see this potential issue 
with the WorkerSender thread while performing toSend -> connectOne (one 
argument, requesting QCNXManager_LOCK) -> connectOne (two arguments, requesting 
QCNXManager_LOCK) -> initiateConnection -> getElectionAddress (requesting 
QV_LOCK) which can also race with the QuorumPeer thread for the same locks.




was (Author: castuardo):
Hello,

We are computer science PhD students building a Distributed model checking 
tool. We have been able to reproduce this issue too, when the QuorumPeer thread 
is racing with the Listener thread and gets into deadlock by requesting 
QCNXManager lock while holding QV_LOCK (on the other side, Listener thread 
holds QCNXManager lock and requests QV_LOCK). We also see this potential issue 
with the WorkerSender thread while performing toSend -> connectOne (one 
argument, requesting QCNXManager_LOCK) -> connectAll -> connectOne (two 
arguments, requesting QCNXManager_LOCK) -> initiateConnection -> 
getElectionAddress (requesting QV_LOCK) which can also race with the QuorumPeer 
thread for the same locks.



> Potential server deadlock between follower sync with leader and follower 
> receiving external connection requests.
> 
>
> Key: ZOOKEEPER-2778
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2778
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.5.3
>Reporter: Michael Han
>Assignee: Michael Han
>Priority: Critical
>
> It's possible to have a deadlock during recovery phase. 
> Found this issue by analyzing thread dumps of "flaky" ReconfigRecoveryTest 
> [1]. . Here is a sample thread dump that illustrates the state of the 
> execution:
> {noformat}
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.getElectionAddress(QuorumPeer.java:686)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:265)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:445)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:369)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:642)
> [junit] 
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:472)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1438)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1471)
> [junit] at  
> org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:520)
> [junit] at  
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:88)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {noformat}
> The dead lock happens between the quorum peer thread which running the 
> follower that doing sync with leader work, and the listener of the qcm of the 
> same quorum peer that doing the receiving connection work. Basically to 
> finish sync with leader, the follower needs to synchronize on both QV_LOCK 
> and the qmc object it owns; while in the receiver thread to finish setup an 
> incoming connection the thread needs to synchronize on both the qcm object 
> the quorum peer owns, and the same QV_LOCK. It's easy to see the problem here 
> is the order of acquiring two locks are different, thus depends on timing / 
> actual execution order, two threads might end up acquiring one lock while 
> holding another.
> [1] 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentServersAreObserversInNextConfig



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (ZOOKEEPER-2778) Potential server deadlock between follower sync with leader and follower receiving external connection requests.

2017-09-01 Thread Cesar Stuardo (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16151039#comment-16151039
 ] 

Cesar Stuardo edited comment on ZOOKEEPER-2778 at 9/1/17 7:16 PM:
--

Hello,

We are computer science PhD students building a Distributed model checking 
tool. We have been able to reproduce this issue too, when the QuorumPeer thread 
is racing with the Listener thread and gets into deadlock by requesting 
QCNXManager lock while holding QV_LOCK (on the other side, Listener thread 
holds QCNXManager lock and requests QV_LOCK). We also see this potential issue 
with the WorkerSender thread while performing toSend -> connectOne (one 
argument, requesting QCNXManager_LOCK) -> connectAll -> connectOne (two 
arguments, requesting QCNXManager_LOCK) -> initiateConnection -> 
getElectionAddress (requesting QV_LOCK) which can also race with the QuorumPeer 
thread for the same locks.




was (Author: castuardo):
Hello,

We are computer science PhD students building a Distributed model checking 
tool. We have been able to reproduce this issue too, when the QuorumPeer thread 
is racing with the Listener thread and gets into deadlock by requesting 
QCNXManager lock while holding QV_LOCK (on the other side, Listener thread 
holds QCNXManager lock and requests QV_LOCK). We also see this potential issue 
with the WorkerSender thread while performing toSend -> connectOne (one 
argument, requesting QCNXManager_LOCK) -> connectOne (two arguments, requesting 
QCNXManager_LOCK) -> initiateConnection -> getElectionAddress (requesting 
QV_LOCK) which can also race with the QuorumPeer thread for the same locks.



> Potential server deadlock between follower sync with leader and follower 
> receiving external connection requests.
> 
>
> Key: ZOOKEEPER-2778
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2778
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.5.3
>Reporter: Michael Han
>Assignee: Michael Han
>Priority: Critical
>
> It's possible to have a deadlock during recovery phase. 
> Found this issue by analyzing thread dumps of "flaky" ReconfigRecoveryTest 
> [1]. . Here is a sample thread dump that illustrates the state of the 
> execution:
> {noformat}
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.getElectionAddress(QuorumPeer.java:686)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:265)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:445)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:369)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:642)
> [junit] 
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:472)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1438)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1471)
> [junit] at  
> org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:520)
> [junit] at  
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:88)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {noformat}
> The dead lock happens between the quorum peer thread which running the 
> follower that doing sync with leader work, and the listener of the qcm of the 
> same quorum peer that doing the receiving connection work. Basically to 
> finish sync with leader, the follower needs to synchronize on both QV_LOCK 
> and the qmc object it owns; while in the receiver thread to finish setup an 
> incoming connection the thread needs to synchronize on both the qcm object 
> the quorum peer owns, and the same QV_LOCK. It's easy to see the problem here 
> is the order of acquiring two locks are different, thus depends on timing / 
> actual execution order, two threads might end up acquiring one lock while 
> holding another.
> [1] 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentServersAreObserversInNextConfig



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2778) Potential server deadlock between follower sync with leader and follower receiving external connection requests.

2017-09-01 Thread Cesar Stuardo (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16151039#comment-16151039
 ] 

Cesar Stuardo commented on ZOOKEEPER-2778:
--

Hello,

We are computer science PhD students building a Distributed model checking 
tool. We have been able to reproduce this issue too, when the QuorumPeer thread 
is racing with the Listener thread and gets into deadlock by requesting 
QCNXManager lock while holding QV_LOCK (on the other side, Listener thread 
holds QCNXManager lock and requests QV_LOCK). We also see this potential issue 
with the WorkerSender thread while performing toSend -> connectOne (one 
argument, requesting QCNXManager_LOCK) -> connectOne (two arguments, requesting 
QCNXManager_LOCK) -> initiateConnection -> getElectionAddress (requesting 
QV_LOCK) which can also race with the QuorumPeer thrad for the same locks.



> Potential server deadlock between follower sync with leader and follower 
> receiving external connection requests.
> 
>
> Key: ZOOKEEPER-2778
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2778
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.5.3
>Reporter: Michael Han
>Assignee: Michael Han
>Priority: Critical
>
> It's possible to have a deadlock during recovery phase. 
> Found this issue by analyzing thread dumps of "flaky" ReconfigRecoveryTest 
> [1]. . Here is a sample thread dump that illustrates the state of the 
> execution:
> {noformat}
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.getElectionAddress(QuorumPeer.java:686)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:265)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:445)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:369)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:642)
> [junit] 
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:472)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1438)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1471)
> [junit] at  
> org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:520)
> [junit] at  
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:88)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {noformat}
> The dead lock happens between the quorum peer thread which running the 
> follower that doing sync with leader work, and the listener of the qcm of the 
> same quorum peer that doing the receiving connection work. Basically to 
> finish sync with leader, the follower needs to synchronize on both QV_LOCK 
> and the qmc object it owns; while in the receiver thread to finish setup an 
> incoming connection the thread needs to synchronize on both the qcm object 
> the quorum peer owns, and the same QV_LOCK. It's easy to see the problem here 
> is the order of acquiring two locks are different, thus depends on timing / 
> actual execution order, two threads might end up acquiring one lock while 
> holding another.
> [1] 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentServersAreObserversInNextConfig



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (ZOOKEEPER-2778) Potential server deadlock between follower sync with leader and follower receiving external connection requests.

2017-09-01 Thread Cesar Stuardo (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16151039#comment-16151039
 ] 

Cesar Stuardo edited comment on ZOOKEEPER-2778 at 9/1/17 7:15 PM:
--

Hello,

We are computer science PhD students building a Distributed model checking 
tool. We have been able to reproduce this issue too, when the QuorumPeer thread 
is racing with the Listener thread and gets into deadlock by requesting 
QCNXManager lock while holding QV_LOCK (on the other side, Listener thread 
holds QCNXManager lock and requests QV_LOCK). We also see this potential issue 
with the WorkerSender thread while performing toSend -> connectOne (one 
argument, requesting QCNXManager_LOCK) -> connectOne (two arguments, requesting 
QCNXManager_LOCK) -> initiateConnection -> getElectionAddress (requesting 
QV_LOCK) which can also race with the QuorumPeer thread for the same locks.




was (Author: castuardo):
Hello,

We are computer science PhD students building a Distributed model checking 
tool. We have been able to reproduce this issue too, when the QuorumPeer thread 
is racing with the Listener thread and gets into deadlock by requesting 
QCNXManager lock while holding QV_LOCK (on the other side, Listener thread 
holds QCNXManager lock and requests QV_LOCK). We also see this potential issue 
with the WorkerSender thread while performing toSend -> connectOne (one 
argument, requesting QCNXManager_LOCK) -> connectOne (two arguments, requesting 
QCNXManager_LOCK) -> initiateConnection -> getElectionAddress (requesting 
QV_LOCK) which can also race with the QuorumPeer thrad for the same locks.



> Potential server deadlock between follower sync with leader and follower 
> receiving external connection requests.
> 
>
> Key: ZOOKEEPER-2778
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2778
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.5.3
>Reporter: Michael Han
>Assignee: Michael Han
>Priority: Critical
>
> It's possible to have a deadlock during recovery phase. 
> Found this issue by analyzing thread dumps of "flaky" ReconfigRecoveryTest 
> [1]. . Here is a sample thread dump that illustrates the state of the 
> execution:
> {noformat}
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.getElectionAddress(QuorumPeer.java:686)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:265)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:445)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:369)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:642)
> [junit] 
> [junit]  java.lang.Thread.State: BLOCKED
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:472)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1438)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1471)
> [junit] at  
> org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:520)
> [junit] at  
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:88)
> [junit] at  
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {noformat}
> The dead lock happens between the quorum peer thread which running the 
> follower that doing sync with leader work, and the listener of the qcm of the 
> same quorum peer that doing the receiving connection work. Basically to 
> finish sync with leader, the follower needs to synchronize on both QV_LOCK 
> and the qmc object it owns; while in the receiver thread to finish setup an 
> incoming connection the thread needs to synchronize on both the qcm object 
> the quorum peer owns, and the same QV_LOCK. It's easy to see the problem here 
> is the order of acquiring two locks are different, thus depends on timing / 
> actual execution order, two threads might end up acquiring one lock while 
> holding another.
> [1] 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentServersAreObserversInNextConfig



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)