[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14740513#comment-14740513 ]
Akihiro Suda commented on ZOOKEEPER-2080: ----------------------------------------- Looking at JaCoCo reports, I also noticed that [{{QCM.SendWorker#finish()}}|https://github.com/apache/zookeeper/blob/df7d56d25d38f872b5793af365ef732c4478eb1d/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L352-L360] (and hence {{QCM.RecvWorker#finish()}}) in {{QCM#receiveConnection()}} ({{sid < self.getId()}}) is called only on failed experiments. When I comment out this, the bug got hard to be reproduced. So I belive that the bug is caused by *a race condition between TCP packet arrivals and {{SendWorker}}/{{RecvWorker}} lifecycles*. Especially, the socket handling in [{{QCM.RecvWorker#run}}|https://github.com/apache/zookeeper/blob/df7d56d25d38f872b5793af365ef732c4478eb1d/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L893-L926] is *very suspicious*, as it cannot be interrupted nor timed out. (Should use {{java.nio.channels.SocketChannel}} rather than plain old {{java.net.Socket}}.) Note that the bug also got hard to be reproduced when I comment out [{{Socket#setTcpNoDelay(true)}}|https://github.com/apache/zookeeper/blob/df7d56d25d38f872b5793af365ef732c4478eb1d/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L566] in {{QCM#setSockOpts()}} (as I reported on Aug 14), or use {{BufferedOutputStream}} instead of {{DataOutputStream}} in [{{QCM.SendWorker()}}|https://github.com/apache/zookeeper/blob/df7d56d25d38f872b5793af365ef732c4478eb1d/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L719]. [~shralex], can I have your opinion on this? > ReconfigRecoveryTest fails intermittently > ----------------------------------------- > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task > Reporter: Ted Yu > Assignee: Raul Gutierrez Segales > Priority: Minor > Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)