[
https://issues.apache.org/jira/browse/ZOOKEEPER-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15342476#comment-15342476
]
Chris Nauroth commented on ZOOKEEPER-2380:
------------------------------------------
The bug is in a code path for handling loss of quorum. Unfortunately, if it
enters FOLLOWING state, then it means the test run hasn't really covered that
code path. It would be great if this test could be made to cover the fixed
code path predictably. Maybe it would help if the stubbed code dropped more
packet types than just the PING?
> Deadlock between leader shutdown and forwarding ACK to the leader
> -----------------------------------------------------------------
>
> Key: ZOOKEEPER-2380
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2380
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Reporter: Arshad Mohammad
> Assignee: Arshad Mohammad
> Priority: Blocker
> Fix For: 3.5.2, 3.6.0
>
> Attachments: ZOOKEEPER-2380-01.patch, ZOOKEEPER-2380-02.patch,
> ZOOKEEPER-2380-03.patch, ZOOKEEPER-2380-04.patch, ZOOKEEPER-2380-05.patch,
> ZOOKEEPER-2380-06.patch, ZOOKEEPER-2380-fail.out
>
>
> Zookeeper enters into deadlock while shutting down itself, thus making
> zookeeper service unavailable as deadlocked server is a leader. Here is the
> thread dump:
> {code}
> "QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled)" #25 prio=5
> os_prio=0 tid=0x00007fbc502a6800 nid=0x834 in Object.wait()
> [0x00007fbc4d9a8000] java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method) at
> java.lang.Thread.join(Thread.java:1245) - locked <
> 0x00000000feb78000> (a org.apache.zookeeper.server.SyncRequestProcessor)
> at java.lang.Thread.join(Thread.java:1319) at
> org.apache.zookeeper.server.SyncRequestProcessor.shutdown(SyncRequestProcessor.java:196)
> at
> org.apache.zookeeper.server.quorum.ProposalRequestProcessor.shutdown(ProposalRequestProcessor.java:90)
> at
> org.apache.zookeeper.server.PrepRequestProcessor.shutdown(PrepRequestProcessor.java:1016)
> at
> org.apache.zookeeper.server.quorum.LeaderRequestProcessor.shutdown(LeaderRequestProcessor.java:78)
> at
> org.apache.zookeeper.server.ZooKeeperServer.shutdown(ZooKeeperServer.java:561)
> - locked <
> 0x00000000feb61e20> (a
> org.apache.zookeeper.server.quorum.LeaderZooKeeperServer) at
> org.apache.zookeeper.server.quorum.QuorumZooKeeperServer.shutdown(QuorumZooKeeperServer.java:169)
> - locked <
> 0x00000000feb61e20> (a
> org.apache.zookeeper.server.quorum.LeaderZooKeeperServer) at
> org.apache.zookeeper.server.quorum.LeaderZooKeeperServer.shutdown(LeaderZooKeeperServer.java:102)
> - locked <
> 0x00000000feb61e20> (a
> org.apache.zookeeper.server.quorum.LeaderZooKeeperServer) at
> org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:637) at
> org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:590) - locked
> <
> 0x00000000feb781a0> (a org.apache.zookeeper.server.quorum.Leader) at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1108)
> "SyncThread:1" #46 prio=5 os_prio=0 tid=0x00007fbc5848f000 nid=0x867 waiting
> for monitor entry [0x00007fbc4ca90000] java.lang.Thread.State: BLOCKED
> (on object monitor) at
> org.apache.zookeeper.server.quorum.Leader.processAck(Leader.java:784) -
> waiting to lock <0x00000000feb781a0> (a
> org.apache.zookeeper.server.quorum.Leader) at
> org.apache.zookeeper.server.quorum.AckRequestProcessor.processRequest(AckRequestProcessor.java:46)
> at
> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:183)
> at
> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:113)
> {code}
> Leader.lead() calls shutdown() from the synchronized block, it acquired lock
> on Leader.java instance
> {code}
> while (true) {
> synchronized (this) {
> long start = Time.currentElapsedTime();
> .....
> {code}
> In the shutdown flow SyncThread is trying to acquire lock on the same
> Leader.java instance.
> Leader thread acquired lock and waiting for SyncThread shutdown. SyncThread
> waiting for the lock to complete its shutdown. This is how ZooKeeper entered
> into deadlock
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)