[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334720#comment-15334720
 ] 

Michael Han commented on ZOOKEEPER-2080:
----------------------------------------

I've been trying reproduce the failure cases and analyzing failed logs in last 
couple of days. After mining enough data, I am fairly confident to say that the 
culprit responsible for the sporadic failure of this test case is 
FastLeaderElection.shutdown, which never returns in the failed cases. What 
happened looks like:

* Server 3 joins ensemble and starts looking for a leader.
* Connections between server 3 and 2/1/0 were broken for some reasons (unclear 
to me, but it happen on both failed and succeeded cases.).
* Server 3 restarts leader election (happens on both failed and succeed cases.).
* The first thing when restart leader election is to shutdown the old FLE, 
where server 3 halts (when joining listener thread.) in failed cases. From this 
point, server 3 is left in a bad state and would never recover (increase 
timeout would not help). 

This also aligns with some of observations previously pointed out by Alex and 
Akihiro. Fix ZOOKEEPER-2246 might fix this as well, so I assigned that issue to 
myself. Working on a patch now (which, not might require get ZOOKEEPER-900 done 
first, we will see.).

> ReconfigRecoveryTest fails intermittently
> -----------------------------------------
>
>                 Key: ZOOKEEPER-2080
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
>             Project: ZooKeeper
>          Issue Type: Sub-task
>            Reporter: Ted Yu
>            Assignee: Michael Han
>         Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to