[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15345456#comment-15345456 ]
Michael Han commented on ZOOKEEPER-2080: ---------------------------------------- The root cause of FLE shutdown never returns is a deadlock introduced as part of ZOOKEEPER-107. The deadlock happens between WorkerReceiver thread of the Messenger in FastLeaderElection, and the Listener thread in QuorumCnxManager when FastLeaderElection requests restart a new leader election as part of dynamic reconfiguration change. An example: # FastLeaderElection requests [restart leader election|https://github.com/apache/zookeeper/blob/ec056d3c3a18b862d0cd83296b7d4319652b0b1c/src/java/main/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L303]. Note this block is synchronized on the QuorumPeer object self. # Restart leader election requires shut down existing QuorumCnxManager first, which requires [waiting for listener thread to finish execution|https://github.com/apache/zookeeper/blob/3c37184e83a3e68b73544cebccf9388eea26f523/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L539]. # At the same time, listener threads could be in a state where it is [initiating new connections out|https://github.com/apache/zookeeper/blob/3c37184e83a3e68b73544cebccf9388eea26f523/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L355]. # While at the previous state, listener thread could run into invocation of connectOne, which is [synchronized on the same QuorumPeer object|https://github.com/apache/zookeeper/blob/3c37184e83a3e68b73544cebccf9388eea26f523/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L475] that FLE shutdown acquired earlier. # As a result, FastLeaderElection is waiting for Listener thread to finish while listener thread is waiting for FLE to release the intrinsic lock on QuorumPeer, thus deadlock. The code path that triggers the deadlock is introduced in ZOOKEEPER-107, so this issue only impacts 3.5 and not 3.4. I am attaching a patch that fixes the issue by specifying a timeout value when join listener thread. I am not super satisfied with this fix as relying on timeout is fragile, but it does fix the problem (validated all tests passed with my endurance test suites), and the side effect of bailing out seems trivial as the listener threads is going to die anyway and bail out does not cause leaking any resources. I am going to dig deeper into reconfig logic see if there is way to fix the deadlock which is better than bail out on listener's side. Meanwhile this harmless patch is ready to go in if we need a quick / dirty way of fixing the problem. Also attach a thread dump that indicates the dead lock situation. > ReconfigRecoveryTest fails intermittently > ----------------------------------------- > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task > Reporter: Ted Yu > Assignee: Michael Han > Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)