[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15517870#comment-15517870 ]
Michael Han commented on ZOOKEEPER-2080: ---------------------------------------- Thanks a ton for your feedback Alex, and Flavio. Had more thoughts about this today. First to reply your comments posted earlier: bq. If we pass the QV instance as part of the connectOne call, then can we remove the synchronized(self) block? To pass QV instances we have to first retrieve them from a QuorumPeer object. Get QV instances on a QuorumPeer object requires synchronization on the QuorumPeer itself, either using existing methods or using what Alex was suggesting: bq. To remove the synchronized block completely you could create a synchronized function in QuorumPeer that returns a pair So in any case we will still possibly end up with a deadlock somewhere inside listener threads - we just moved the deadlock from one place to the other :) - because somewhere in the listener thread of QuorumCnxManager we are going to invoke synchronized version of getQV, which might block because the same QuorumPeer object which the getQV methods are invoking were acquired as a lock during restart of FLE. To recap the deadlock problem - a simplified version description: * Restart leader election acquires a QuorumPeer lock. * Restart leader election requires shut down QuorumCnxManager as part of the process while holding the lock. * Shutdown QuorumCnxManager requires synchronize on the same QuorumPeer. The solution for each of these in order: * Don't acquire a QuorumPeer lock when restart LE. This is less likely a solution due to the state changes to QuorumPeer during the process required the process has to be synchronized. * Don't hold the lock while shutdown QuorumCnxManager. This is essentially the latest patch did by moving the shutdown out of the sync block. However, it does not fix the issue I spotted in previous comments (potential dead lock in QuorumPeer.restartLeaderElection - because this method synchronized implicitly on the intrinsic lock). * Don't synchronize on the QuorumPeer inside QuorumCnxManager's shutdown process: this implies we can't involve any of the QuorumPeer's synchronized methods in listener threads run loop - otherwise we will end up a deadlock. As previously commented, removing {{synchronized(self)}} block by passing QV instance to {{connectOne}} would not work because we still need to invoke the synchronized version of getQA somewhere in listener threads. Other thoughts - we could possibly interrupt the connectOne which is synchronizing on the QuorumPeer. It should be fine to interrupt because at this point everything is shutting down and the server socket was already closed (in listener.halt) so no resource leak concerns. So we just need to find a way that after we call {{listener.halt()}} in {{QuorumCnxManager.halt()}}, we can signal the listener threads to bail out the blocking state - I need to think more about how to do this because it seems not possible to interrupt a synchronized code block - so we may have to use additional abstractions (i.e. ReentrantLock?) which adds more complexity. > ReconfigRecoveryTest fails intermittently > ----------------------------------------- > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task > Reporter: Ted Yu > Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)