[
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696437#comment-14696437
]
Akihiro Suda commented on ZOOKEEPER-2080:
-----------------------------------------
The bug can be almost always reproduced by injecting 80 msecs delay to every
FLE packets with my tool:
https://github.com/osrg/earthquake/tree/9078c5b039762f6c201ee036ac3453caf6168055/example/zk-repro-2080.nfqhook
When I comment out
[{{Socket#setTcpNoDelay(true)}}|https://github.com/apache/zookeeper/blob/5b1b668d33ccf7d93c31db2a53728177393fea90/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L566]
in {{QCnxM#setSockOpts()}}, the bug gets hard to be reproduced.
So I guess the bug is caused by a race condition in {{QCnxM}} (or in {{FLE}}).
Anyone can give us some advice about suspicious point in {{QCnxM}}?
ZOOKEEPER-2246 might be related to 2080, but just applying [this
fix|https://issues.apache.org/jira/browse/ZOOKEEPER-2246?focusedCommentId=14694804&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14694804]
proposed in 2246 does not resolve 2080.
> ReconfigRecoveryTest fails intermittently
> -----------------------------------------
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
> Issue Type: Sub-task
> Reporter: Ted Yu
> Assignee: Raul Gutierrez Segales
> Priority: Minor
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
> FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
> at
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
> at
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)