[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334720#comment-15334720 ]
Michael Han commented on ZOOKEEPER-2080: ---------------------------------------- I've been trying reproduce the failure cases and analyzing failed logs in last couple of days. After mining enough data, I am fairly confident to say that the culprit responsible for the sporadic failure of this test case is FastLeaderElection.shutdown, which never returns in the failed cases. What happened looks like: * Server 3 joins ensemble and starts looking for a leader. * Connections between server 3 and 2/1/0 were broken for some reasons (unclear to me, but it happen on both failed and succeeded cases.). * Server 3 restarts leader election (happens on both failed and succeed cases.). * The first thing when restart leader election is to shutdown the old FLE, where server 3 halts (when joining listener thread.) in failed cases. From this point, server 3 is left in a bad state and would never recover (increase timeout would not help). This also aligns with some of observations previously pointed out by Alex and Akihiro. Fix ZOOKEEPER-2246 might fix this as well, so I assigned that issue to myself. Working on a patch now (which, not might require get ZOOKEEPER-900 done first, we will see.). > ReconfigRecoveryTest fails intermittently > ----------------------------------------- > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task > Reporter: Ted Yu > Assignee: Michael Han > Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)