[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haoze Wu updated ZOOKEEPER-4074:
--------------------------------
    Priority: Critical  (was: Major)

> Network issue while Learner is executing writePacket can cause the follower 
> to hang
> -----------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4074
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4074
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.6.2
>            Reporter: Haoze Wu
>            Priority: Critical
>
> We were doing some systematic fault injection testing on the latest ZooKeeper 
> stable release 3.6.2 and found an untimely network issue can cause ZooKeeper 
> followers to hang: clients connected to this follower get stuck in their 
> requests, while the follower would not rejoin the quorum for a long time.
> Our overall experience through the fault injection testing is that ZooKeeper 
> is robust to tolerate network issues (delay or packet loss) in various 
> places. If a thread is doing a socket write and hangs in this operation due 
> to the network fault, this thread will get stuck. In general, ZooKeeper can 
> handle the issue correctly even though this thread hangs. For example, in a 
> leader, if the `LearnerHandler` thread hangs in this way, the `QuorumPeer` 
> (which is running the Leader#lead method) is able to confirm the stale PING 
> state and abandon the problematic `QuorumPeer`. 
> h2. Symptom
> However, in the latest ZooKeeper stable release 3.6.2, we found that if a 
> network issue happens to occur while the `writePacket` method of the 
> `Learner` class is executing, the entire follower can get stuck. In 
> particular, the whole `QuorumPeer` thread would be blocked because it is 
> calling the `writePacket` method. Unlike the other situations in which the 
> fault could be tolerated by ZooKeeper, QuorumPeer itself and other threads 
> did not detect or handle this issue. Therefore, this follower hangs in the 
> sense that it is not able to communicate with the leader, and the leader will 
> abandon this follower once it does not reply `PING` packets in time, although 
> this follower still believes it's a follower. 
> Steps to reproduce are as follows:
>  # Start a cluster with 1 leader and 2 followers.
>  # Manually create some datanodes, and do some reads and writes.
>  # Inject network fault, either using a tool like `tcconfig` or the attached 
> Byteman scripts. 
>  # Once stuck, you may observe new requests to this follower would also get 
> stuck.
> The scripts for reproduction are provided in 
> [https://gist.github.com/functioner/ad44b5e457c8cb22eac5fc861f56d0d4].
> We confirmed the issue also occurs in the latest master branch version.
> h2. Root Cause
> The `writePacket` method can be invoked by 3 threads: 
> FollowerRequestProcessor, SyncRequestProcessor, and QuorumPeer. In 
> particular, we have these possible stack traces in the attached 
> `stacktrace.md` in 
> [https://gist.github.com/functioner/ad44b5e457c8cb22eac5fc861f56d0d4].
> There are two key issues. First, the network I/O is performed inside a 
> synchronization block. Second, unlike the socket connection or read 
> operations that are protected by timeouts in ZooKeeper, the socket (and 
> OutputArchive with the socket OutputStream) write would not throw the timeout 
> exception when the write is stuck (until the network issue is resolved). In 
> this case, while the `QuorumPeer#readPacket` method contains a socket 
> timeout, the reason that the follower (QuorumPeer) did not initiate the 
> rejoin is because the QuorumPeer is blocked in writePacket and would not 
> proceed to the receiving stage (`readPacket`), so no timeout exception is 
> thrown to trigger the error handling.
> {code:java}
>     void writePacket(QuorumPacket pp, boolean flush) throws IOException {
>         synchronized (leaderOs) {
>             if (pp != null) {
>                 messageTracker.trackSent(pp.getType());
>                 leaderOs.writeRecord(pp, "packet");
>             }
>             if (flush) {
>                 bufferedOutput.flush();
>             }
>         }
>     }
> {code}
> h2. Fix 
> When we are preparing to submit this bug report, we find that the 
> ZOOKEEPER-3575 has already proposed a fix to move the packet sending in 
> learner to a separate thread. But the reason that we are still able to expose 
> the symptom in the `master` branch is because the fix is somehow disabled by 
> default with a configuration parameter `learner.asyncSending` that is not 
> documented. We tried the fault injection testing on the `master` branch 
> version with this parameter set to be true (`-Dlearner.asyncSending=true`) 
> and found the symptom would be gone. Specifically, even though with the 
> network issue, the packet writes to the leader would still be stuck, now the 
> `QuorumPeer` thread would not be blocked and it can detect the issue during 
> the phase of receiving packet from the leader thanks to the timeout in the 
> socket read. As a result, the follower would be able to quickly go back to 
> `LOOKING` and then `FOLLOWING` state again, while the problematic 
> `LearnerSender` would be abandoned and recreated.
> h2. Proposed Improvements
> It seems that the fix in ZOOKEEPER-3575 was not enabled by default is perhaps 
> because it was not clear whether the issue could really occur. 
> We would like to first confirm this is an issue and hope the attached 
> reproducing scripts can be helpful. Also, our testing shows this issue occurs 
> not only in the shutdown phase as pointed out in ZOOKEEPER-3575 but also in 
> regular requests handling, which can be serious.   
> In addition, we would like to propose making the parameter 
> `learner.asyncSending` default to be true so the fix can be enabled by 
> default. A related improvement is to add a description of this parameter in 
> the documentation. Otherwise, administrators would have to read the source 
> code and to be lucky enough to stumble upon the beginning section of 
> Learner.java to realize there is a parameter to fix the behavior.
> Lastly, the async packet sending only appears in the `master` branch (and 
> 3.7.x). There is no such fix or parameter in the latest stable release 
> (3.6.2). We are wondering if this fix should be backported to the 3.6.x 
> branch?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to