[ https://issues.apache.org/jira/browse/ZOOKEEPER-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Haoze Wu updated ZOOKEEPER-4074: -------------------------------- Priority: Critical (was: Major) > Network issue while Learner is executing writePacket can cause the follower > to hang > ----------------------------------------------------------------------------------- > > Key: ZOOKEEPER-4074 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4074 > Project: ZooKeeper > Issue Type: Bug > Components: server > Affects Versions: 3.6.2 > Reporter: Haoze Wu > Priority: Critical > > We were doing some systematic fault injection testing on the latest ZooKeeper > stable release 3.6.2 and found an untimely network issue can cause ZooKeeper > followers to hang: clients connected to this follower get stuck in their > requests, while the follower would not rejoin the quorum for a long time. > Our overall experience through the fault injection testing is that ZooKeeper > is robust to tolerate network issues (delay or packet loss) in various > places. If a thread is doing a socket write and hangs in this operation due > to the network fault, this thread will get stuck. In general, ZooKeeper can > handle the issue correctly even though this thread hangs. For example, in a > leader, if the `LearnerHandler` thread hangs in this way, the `QuorumPeer` > (which is running the Leader#lead method) is able to confirm the stale PING > state and abandon the problematic `QuorumPeer`. > h2. Symptom > However, in the latest ZooKeeper stable release 3.6.2, we found that if a > network issue happens to occur while the `writePacket` method of the > `Learner` class is executing, the entire follower can get stuck. In > particular, the whole `QuorumPeer` thread would be blocked because it is > calling the `writePacket` method. Unlike the other situations in which the > fault could be tolerated by ZooKeeper, QuorumPeer itself and other threads > did not detect or handle this issue. Therefore, this follower hangs in the > sense that it is not able to communicate with the leader, and the leader will > abandon this follower once it does not reply `PING` packets in time, although > this follower still believes it's a follower. > Steps to reproduce are as follows: > # Start a cluster with 1 leader and 2 followers. > # Manually create some datanodes, and do some reads and writes. > # Inject network fault, either using a tool like `tcconfig` or the attached > Byteman scripts. > # Once stuck, you may observe new requests to this follower would also get > stuck. > The scripts for reproduction are provided in > [https://gist.github.com/functioner/ad44b5e457c8cb22eac5fc861f56d0d4]. > We confirmed the issue also occurs in the latest master branch version. > h2. Root Cause > The `writePacket` method can be invoked by 3 threads: > FollowerRequestProcessor, SyncRequestProcessor, and QuorumPeer. In > particular, we have these possible stack traces in the attached > `stacktrace.md` in > [https://gist.github.com/functioner/ad44b5e457c8cb22eac5fc861f56d0d4]. > There are two key issues. First, the network I/O is performed inside a > synchronization block. Second, unlike the socket connection or read > operations that are protected by timeouts in ZooKeeper, the socket (and > OutputArchive with the socket OutputStream) write would not throw the timeout > exception when the write is stuck (until the network issue is resolved). In > this case, while the `QuorumPeer#readPacket` method contains a socket > timeout, the reason that the follower (QuorumPeer) did not initiate the > rejoin is because the QuorumPeer is blocked in writePacket and would not > proceed to the receiving stage (`readPacket`), so no timeout exception is > thrown to trigger the error handling. > {code:java} > void writePacket(QuorumPacket pp, boolean flush) throws IOException { > synchronized (leaderOs) { > if (pp != null) { > messageTracker.trackSent(pp.getType()); > leaderOs.writeRecord(pp, "packet"); > } > if (flush) { > bufferedOutput.flush(); > } > } > } > {code} > h2. Fix > When we are preparing to submit this bug report, we find that the > ZOOKEEPER-3575 has already proposed a fix to move the packet sending in > learner to a separate thread. But the reason that we are still able to expose > the symptom in the `master` branch is because the fix is somehow disabled by > default with a configuration parameter `learner.asyncSending` that is not > documented. We tried the fault injection testing on the `master` branch > version with this parameter set to be true (`-Dlearner.asyncSending=true`) > and found the symptom would be gone. Specifically, even though with the > network issue, the packet writes to the leader would still be stuck, now the > `QuorumPeer` thread would not be blocked and it can detect the issue during > the phase of receiving packet from the leader thanks to the timeout in the > socket read. As a result, the follower would be able to quickly go back to > `LOOKING` and then `FOLLOWING` state again, while the problematic > `LearnerSender` would be abandoned and recreated. > h2. Proposed Improvements > It seems that the fix in ZOOKEEPER-3575 was not enabled by default is perhaps > because it was not clear whether the issue could really occur. > We would like to first confirm this is an issue and hope the attached > reproducing scripts can be helpful. Also, our testing shows this issue occurs > not only in the shutdown phase as pointed out in ZOOKEEPER-3575 but also in > regular requests handling, which can be serious. > In addition, we would like to propose making the parameter > `learner.asyncSending` default to be true so the fix can be enabled by > default. A related improvement is to add a description of this parameter in > the documentation. Otherwise, administrators would have to read the source > code and to be lucky enough to stumble upon the beginning section of > Learner.java to realize there is a parameter to fix the behavior. > Lastly, the async packet sending only appears in the `master` branch (and > 3.7.x). There is no such fix or parameter in the latest stable release > (3.6.2). We are wondering if this fix should be backported to the 3.6.x > branch? -- This message was sent by Atlassian Jira (v8.3.4#803005)