[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15396091#comment-15396091 ]
Michael Han commented on ZOOKEEPER-2080: ---------------------------------------- Hi Alex, thanks for the review :) bq. do you think that the creation of a new election object won't be interfered if the old object shutdown/GC hasn't happened yet The new leader election object and the old leader election object does not share object state: each object has their own QuorumCnxManager that manages the underlying TCP connections used for leader election. They could in theory possibly share the same socket address (election address), because I believe this address is statically generated from the connection string instead of dynamically generated (like the uniquePort utility we had in test), and this address seems to be only thing that different QuroumCnxManager shares. In theory we might have two QuorumCnxManager, one from old election object waiting to be shutdown and the other one from the new election object, that both try binding to same address. I haven't found any issues related this though during my stress test on unit tests (in particular for reconfig test), and I think we could possibly address this issue by some retry logic with exponential back off when binding to socket in QuorumCnxManager. bq. any way to test this using a unit test I don't have any concrete ideas around this, my thinking is we could possibly expose some options from related classes under test so we can artificially inject faults, creating race conditions and control timings. For example we could delay the shut down of the old leader election object and see what happens. As a simple test, I simply remove the statement completely and 5 out of 6 ReconfigRecoveryTest tests failed, which is expected because that is not supposed to be completely removed, so maybe instead of removing we can add a delay and make sure everything still works. > ReconfigRecoveryTest fails intermittently > ----------------------------------------- > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task > Reporter: Ted Yu > Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)