[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

Michael Han (JIRA) Fri, 23 Sep 2016 16:31:49 -0700

    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15517870#comment-15517870
 ]


Michael Han commented on ZOOKEEPER-2080:
----------------------------------------

Thanks a ton for your feedback Alex, and Flavio. Had more thoughts about this 
today. First to reply your comments posted earlier:

bq. If we pass the QV instance as part of the connectOne call, then can we 
remove the synchronized(self) block?
To pass QV instances we have to first retrieve them from a QuorumPeer object. 
Get QV instances on a QuorumPeer object requires synchronization on the 
QuorumPeer itself, either using existing methods or using what Alex was 
suggesting:
bq. To remove the synchronized block completely you could create a synchronized 
function in QuorumPeer that returns a pair

So in any case we will still possibly end up with a deadlock somewhere inside 
listener threads - we just moved the deadlock from one place to the other :) - 
because somewhere in the listener thread of QuorumCnxManager we are going to 
invoke synchronized version of getQV, which might block because the same 
QuorumPeer object which the getQV methods are invoking were acquired as a lock 
during restart of FLE.

To recap the deadlock problem - a simplified version description:
* Restart leader election acquires a QuorumPeer lock.
* Restart leader election requires shut down QuorumCnxManager as part of the 
process while holding the lock.
* Shutdown QuorumCnxManager requires synchronize on the same QuorumPeer.

The solution for each of these in order:
* Don't acquire a QuorumPeer lock when restart LE. This is less likely a 
solution due to the state changes to QuorumPeer during the process required the 
process has to be synchronized.
* Don't hold the lock while shutdown QuorumCnxManager. This is essentially the 
latest patch did by moving the shutdown out of the sync block. However, it does 
not fix the issue I spotted in previous comments (potential dead lock in 
QuorumPeer.restartLeaderElection - because this method synchronized implicitly 
on the intrinsic lock). 
* Don't synchronize on the QuorumPeer inside QuorumCnxManager's shutdown 
process: this implies we can't involve any of the QuorumPeer's synchronized 
methods in listener threads run loop - otherwise we will end up a deadlock. As 
previously commented, removing {{synchronized(self)}} block by passing QV 
instance to {{connectOne}} would not work because we still need to invoke the 
synchronized version of getQA somewhere in listener threads.

Other thoughts - we could possibly interrupt the connectOne which is 
synchronizing on the QuorumPeer. It should be fine to interrupt because at this 
point everything is shutting down and the server socket was already closed (in 
listener.halt) so no resource leak concerns. So we just need to find a way that 
after we call {{listener.halt()}} in {{QuorumCnxManager.halt()}}, we can signal 
the listener threads to bail out the blocking state - I need to think more 
about how to do this because it seems not possible to interrupt a synchronized 
code block - so we may have to use additional abstractions (i.e. 
ReentrantLock?) which adds more complexity.

 



> ReconfigRecoveryTest fails intermittently
> -----------------------------------------
>
>                 Key: ZOOKEEPER-2080
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
>             Project: ZooKeeper
>          Issue Type: Sub-task
>            Reporter: Ted Yu
>            Assignee: Michael Han
>             Fix For: 3.5.3, 3.6.0
>
>         Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

Reply via email to