[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

Michael Han (JIRA) Thu, 20 Oct 2016 16:42:46 -0700

    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593380#comment-15593380
 ]


Michael Han commented on ZOOKEEPER-2080:
----------------------------------------

Getting back on this issue. I have another perspective - rather than 
synchronizing on entire QuorumPeer we could use fine grained locks to 
specifically guard the set of states affiliated with QuorumVerifier, namely:
* Quorum verifier.
* Last seen quorum verifier.
* Some additional fields that appertain to QV and last QV, such as quorum 
address, election port, and client address.

As for what to guard, I feel guard on individual fields (QV, last QV, quorum 
address, etc) is enough after examining the code paths and call graphs - if I 
am wrong, then we will need to figure out a compound state and invent a new 
class that encapsulates all these states. Also we should make sure that every 
access to these states will be guarded by the same lock. 

bq. But this update would have to synchronize on both QP and QCM, so not sure 
if the same problem exists as above.
I feel the same way - this solution just moves the lock around =).

bq. Do we need to update all view listeners in a critical section?
I think yes, if we want to maintain a global consistent view of the QV states. 
When the value of QV in QP is updated, QP needs to fire events to notify 
subscribers of such change, and without locking when the event hits subscriber 
the view of the world might already changed. We might also need locking on the 
event callbacks of each listener, to avoid any potential race condition of each 
listener. But this case the likely hood of running into race / dead lock sounds 
lower as we are not synchronizing on a single QP anymore.

In any cases, I feel refactoring the QuorumPeer to use fine grained locking is 
necessary. Attaching a patch to express my idea.

> ReconfigRecoveryTest fails intermittently
> -----------------------------------------
>
>                 Key: ZOOKEEPER-2080
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
>             Project: ZooKeeper
>          Issue Type: Sub-task
>            Reporter: Ted Yu
>            Assignee: Michael Han
>             Fix For: 3.5.3, 3.6.0
>
>         Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

Reply via email to