[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14740513#comment-14740513
 ] 

Akihiro Suda commented on ZOOKEEPER-2080:
-----------------------------------------

Looking at JaCoCo reports, I also noticed that 
[{{QCM.SendWorker#finish()}}|https://github.com/apache/zookeeper/blob/df7d56d25d38f872b5793af365ef732c4478eb1d/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L352-L360]
 (and hence {{QCM.RecvWorker#finish()}}) in {{QCM#receiveConnection()}} ({{sid 
< self.getId()}}) is called only on failed experiments.

When I comment out this, the bug got hard to be reproduced.

So I belive that the bug is caused by *a race condition between TCP packet 
arrivals and {{SendWorker}}/{{RecvWorker}} lifecycles*.

Especially, the socket handling in 
[{{QCM.RecvWorker#run}}|https://github.com/apache/zookeeper/blob/df7d56d25d38f872b5793af365ef732c4478eb1d/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L893-L926]
 is *very suspicious*, as it cannot be interrupted nor timed out.
(Should use {{java.nio.channels.SocketChannel}} rather than plain old 
{{java.net.Socket}}.)

Note that the bug also got hard to be reproduced when I comment out 
[{{Socket#setTcpNoDelay(true)}}|https://github.com/apache/zookeeper/blob/df7d56d25d38f872b5793af365ef732c4478eb1d/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L566]
 in {{QCM#setSockOpts()}} (as I reported on Aug 14), or use 
{{BufferedOutputStream}} instead of {{DataOutputStream}} in 
[{{QCM.SendWorker()}}|https://github.com/apache/zookeeper/blob/df7d56d25d38f872b5793af365ef732c4478eb1d/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L719].

[~shralex], can I have your opinion on this?



> ReconfigRecoveryTest fails intermittently
> -----------------------------------------
>
>                 Key: ZOOKEEPER-2080
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
>             Project: ZooKeeper
>          Issue Type: Sub-task
>            Reporter: Ted Yu
>            Assignee: Raul Gutierrez Segales
>            Priority: Minor
>         Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to