Haoze Wu created ZOOKEEPER-4074:
-----------------------------------

             Summary: Network issue while Learner is executing writePacket can 
cause the follower to hang
                 Key: ZOOKEEPER-4074
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4074
             Project: ZooKeeper
          Issue Type: Bug
          Components: server
    Affects Versions: 3.6.2
            Reporter: Haoze Wu


We were doing some systematic fault injection testing on the latest ZooKeeper 
stable release 3.6.2 and found an untimely network issue can cause ZooKeeper 
followers to hang: clients connected to this follower get stuck in their 
requests, while the follower would not rejoin the quorum for a long time.

 

Our overall experience through the fault injection testing is that ZooKeeper is 
robust to tolerate network issues (delay or packet loss) in various places. If 
a thread is doing a socket write and hangs in this operation due to the network 
fault, this thread will get stuck. In general, ZooKeeper can handle the issue 
correctly even though this thread hangs. For example, in a leader, if the 
`LearnerHandler` thread hangs in this way, the `QuorumPeer` (which is running 
the Leader#lead method) is able to confirm the stale PING state and abandon the 
problematic `QuorumPeer`. 
h2. Symptom

However, in the latest ZooKeeper stable release 3.6.2, we found that if a 
network issue happens to occur while the `writePacket` method of the `Learner` 
class is executing, the entire follower can get stuck. In particular, the whole 
`QuorumPeer` thread would be blocked because it is calling the `writePacket` 
method. Unlike the other situations in which the fault could be tolerated by 
ZooKeeper, QuorumPeer itself and other threads did not detect or handle this 
issue. Therefore, this follower hangs in the sense that it is not able to 
communicate with the leader, and the leader will abandon this follower once it 
does not reply `PING` packets in time, although this follower still believes 
it's a follower. 

 

Steps to reproduce are as follows:

 
 # Start a cluster with 1 leader and 2 followers.
 # Manually create some datanodes, and do some reads and writes.
 # Inject network fault, either using a tool like `tcconfig` or the attached 
Byteman scripts. 
 # Once stuck, you may observe new requests to this follower would also get 
stuck.

 

The scripts for reproduction are provided in 
[https://gist.github.com/functioner/ad44b5e457c8cb22eac5fc861f56d0d4].

 

We confirmed the issue also occurs in the latest master branch version.

 
h2. Root Cause

 

The `writePacket` method can be invoked by 3 threads: FollowerRequestProcessor, 
SyncRequestProcessor, and QuorumPeer. In particular, we have these possible 
stack traces in the attached `stacktrace.md` in 
[https://gist.github.com/functioner/ad44b5e457c8cb22eac5fc861f56d0d4].

 

There are two key issues. First, the network I/O is performed inside a 
synchronization block. Second, unlike the socket connection or read operations 
that are protected by timeouts in ZooKeeper, the socket (and OutputArchive with 
the socket OutputStream) write would not throw the timeout exception when the 
write is stuck (until the network issue is resolved). In this case, while the 
`QuorumPeer#readPacket` method contains a socket timeout, the reason that the 
follower (QuorumPeer) did not initiate the rejoin is because the QuorumPeer is 
blocked in writePacket and would not proceed to the receiving stage 
(`readPacket`), so no timeout exception is thrown to trigger the error handling.

 

```

    void writePacket(QuorumPacket pp, boolean flush) throws IOException {

        synchronized (leaderOs) {

            if (pp != null) {

                messageTracker.trackSent(pp.getType());

                leaderOs.writeRecord(pp, "packet");

            }

            if (flush) {

                bufferedOutput.flush();

            }

        }

    }

```
h2. Fix 

 

When we are preparing to submit this bug report, we find that the 
ZOOKEEPER-3575 has already proposed a fix to move the packet sending in learner 
to a separate thread. But the reason that we are still able to expose the 
symptom in the `master` branch is because the fix is somehow disabled by 
default with a configuration parameter `learner.asyncSending` that is not 
documented. We tried the fault injection testing on the `master` branch version 
with this parameter set to be true (`-Dlearner.asyncSending=true`) and found 
the symptom would be gone. Specifically, even though with the network issue, 
the packet writes to the leader would still be stuck, now the `QuorumPeer` 
thread would not be blocked and it can detect the issue during the phase of 
receiving packet from the leader thanks to the timeout in the socket read. As a 
result, the follower would be able to quickly go back to `LOOKING` and then 
`FOLLOWING` state again, while the problematic `LearnerSender` would be 
abandoned and recreated.

 
h2. Proposed Improvements

 

It seems that the fix in ZOOKEEPER-3575 was not enabled by default is perhaps 
because it was not clear whether the issue could really occur. 

 

We would like to first confirm this is an issue and hope the attached 
reproducing scripts can be helpful. Also, our testing shows this issue occurs 
not only in the shutdown phase as pointed out in ZOOKEEPER-3575 but also in 
regular requests handling, which can be serious.   

 

In addition, we would like to propose making the parameter 
`learner.asyncSending` default to be true so the fix can be enabled by default. 
A related improvement is to add a description of this parameter in the 
documentation. Otherwise, administrators would have to read the source code and 
to be lucky enough to stumble upon the beginning section of Learner.java to 
realize there is a parameter to fix the behavior.

 

Lastly, the async packet sending only appears in the `master` branch (and 
3.7.x). There is no such fix or parameter in the latest stable release (3.6.2). 
We are wondering if this fix should be backported to the 3.6.x branch?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to