GitHub user afine opened a pull request:
https://github.com/apache/zookeeper/pull/251
ZOOKEEPER-2781 Flaky test: testClientAuthAgainstNoAuthServerWithLowerSid
The flaky test appears to be caused by a race condition in QuorumCnxManager
that could potentially prevent two servers from connecting to each other. I was
able to reproduce the issue with a debugger and a little bit of patience. It
would be great if someone can share a less contrived way to reproduce the same
issue. Here is the basic order of execution required to reproduce the issue
between two peers (using lines of code from before this patch). Point of
clarification, reaching a line means hitting but not yet executing that line
(equivalent to setting a breakpoint on that line).
1. peer1 enters `startConnection` and reaches QuorumCnxManager.java:365
2. peer0's Listener enters `handleConnection` reaches
QuorumCnxManager.java:506
2. peer0 enters `startConnection` and reaches QuorumCnxManager.java:353
3. peer1's Listener enters `handleConnection` and reaches
QuorumCnxManager.java:483
3. peer1 executes QuorumCnxManager.java:365 and reaches
QuorumCnxManager.java:374
4. peer0's Listener executes QuorumCnxManager.java:506 and starts a
RecvWorker which stops at QuorumCnxManager.java:1027. The Listener reaches
QuorumCnxManager.java:516.
5. peer1's Listener continues executing from QuorumCnxManager.java:483,
which removes the SendWorker and RecvWorker for its connection to peer0, and
reaches QuorumCnxManager.java:493
6. peer0's RecvWorker executes QuorumCnxManager.java:1027, the socket had
since been closed on peer1 and we throw an exception
```
[junit] 2017-05-07 14:48:11,055 [myid:] - WARN
[RecvWorker:1:QuorumCnxManager$RecvWorker@1042] - Connection broken for id 1,
my id = 0, error =
[junit] java.io.EOFException
[junit] at java.io.DataInputStream.readInt(DataInputStream.java:392)
[junit] at
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1027)
```
9. peer0 executes QuorumCnxManager.java:353 reaching
QuorumCnxManager.java:377
10. peer1 executes connectOne at QuorumCnxManager.java:493 and continues
until it reaches QuorumCnxManager.java:290 which completes the thread
11. peer0's listener continues executing from QuorumCnxManager.java:516.
Calls `handleConnection ` again but returns at QuorumCnxManager.java:469 since
peer1 never wrote its sid to the socket.
12. peer0 continues executing from QuorumCnxManager.java:377
13. peer1 continues executing from QuorumCnxManager.java:374
Both threads finish and no quorum has been formed.
`testClientAuthAgainstNoAuthServerWithLowerSid` times out
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/afine/zookeeper ZOOKEEPER-2781
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/zookeeper/pull/251.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #251
----
commit 5f231641a9496aaf84a0b58d87f5fc365fa9b7e6
Author: Abraham Fine <[email protected]>
Date: 2017-05-12T20:23:55Z
ZOOKEEPER-2781: Flaky test: testClientAuthAgainstNoAuthServerWithLowerSid
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---