GitHub user afine opened a pull request:

    https://github.com/apache/zookeeper/pull/251

    ZOOKEEPER-2781 Flaky test: testClientAuthAgainstNoAuthServerWithLowerSid

    The flaky test appears to be caused by a race condition in QuorumCnxManager 
that could potentially prevent two servers from connecting to each other. I was 
able to reproduce the issue with a debugger and a little bit of patience. It 
would be great if someone can share a less contrived way to reproduce the same 
issue. Here is the basic order of execution required to reproduce the issue 
between two peers (using lines of code from before this patch).  Point of 
clarification, reaching a line means hitting but not yet executing that line 
(equivalent to setting a breakpoint on that line).
    
    1. peer1 enters `startConnection` and reaches QuorumCnxManager.java:365
    2. peer0's Listener enters `handleConnection` reaches  
QuorumCnxManager.java:506
    2. peer0 enters `startConnection` and reaches QuorumCnxManager.java:353
    3. peer1's Listener enters `handleConnection` and reaches 
QuorumCnxManager.java:483
    3. peer1 executes QuorumCnxManager.java:365 and reaches 
QuorumCnxManager.java:374
    4. peer0's Listener executes QuorumCnxManager.java:506 and starts a 
RecvWorker which stops at QuorumCnxManager.java:1027. The Listener reaches 
QuorumCnxManager.java:516.
    5. peer1's Listener continues executing from QuorumCnxManager.java:483, 
which removes the SendWorker and RecvWorker for its connection to peer0, and 
reaches QuorumCnxManager.java:493
    6. peer0's RecvWorker executes  QuorumCnxManager.java:1027, the socket had 
since been closed on peer1 and we throw an exception
    ```
        [junit] 2017-05-07 14:48:11,055 [myid:] - WARN  
[RecvWorker:1:QuorumCnxManager$RecvWorker@1042] - Connection broken for id 1, 
my id = 0, error =
        [junit] java.io.EOFException
        [junit]     at java.io.DataInputStream.readInt(DataInputStream.java:392)
        [junit]     at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1027)
    ```
    9. peer0 executes  QuorumCnxManager.java:353 reaching  
QuorumCnxManager.java:377
    10. peer1 executes connectOne at QuorumCnxManager.java:493 and continues 
until it reaches QuorumCnxManager.java:290 which completes the thread
    11. peer0's listener continues executing from QuorumCnxManager.java:516. 
Calls `handleConnection ` again but returns at QuorumCnxManager.java:469 since 
peer1 never wrote its sid to the socket.
    12. peer0 continues executing from QuorumCnxManager.java:377
    13. peer1 continues executing from QuorumCnxManager.java:374
    
    
    Both threads finish and no quorum has been formed. 
`testClientAuthAgainstNoAuthServerWithLowerSid` times out
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/afine/zookeeper ZOOKEEPER-2781

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/zookeeper/pull/251.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #251
    
----
commit 5f231641a9496aaf84a0b58d87f5fc365fa9b7e6
Author: Abraham Fine <[email protected]>
Date:   2017-05-12T20:23:55Z

    ZOOKEEPER-2781: Flaky test: testClientAuthAgainstNoAuthServerWithLowerSid

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to