[ https://issues.apache.org/jira/browse/ZOOKEEPER-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
mutu updated ZOOKEEPER-4844: ---------------------------- Attachment: systemAsync1.log systemAsync2.log systemAsync3.log > Fail-slow disk while executing writeLongToFile can cause the follower to hang > ----------------------------------------------------------------------------- > > Key: ZOOKEEPER-4844 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4844 > Project: ZooKeeper > Issue Type: Bug > Components: server > Affects Versions: 3.10.0 > Reporter: mutu > Priority: Major > Attachments: system1.log, system2.log, system3.log, systemAsync1.log, > systemAsync2.log, systemAsync3.log > > > {*}Symptom:{*}If a thread is doing a file write and stuck in writeLongToFile, > this thread will hang. This blocking shoud be handled by the zookeeper via > PING. However, if the QuorumPeer executes the writeLongToFile and encounters > a fail-slow disk, the entire follower can be stuck. The leader will abandon > this follower, but the follower believes that it is a follower. > Callstack is as following: > {code:java} > at > org.apache.zookeeper.common.AtomicFileOutputStream.write(AtomicFileOutputStream.java:72) > > at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) > at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291) > at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295) > at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141) > at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229) > at java.io.BufferedWriter.flush(BufferedWriter.java:254) > at > org.apache.zookeeper.common.AtomicFileWritingIdiom.<init>(AtomicFileWritingIdiom.java:72) > > at > org.apache.zookeeper.common.AtomicFileWritingIdiom.<init>(AtomicFileWritingIdiom.java:54) > > at > org.apache.zookeeper.server.quorum.QuorumPeer.writeLongToFile(QuorumPeer.java:2233) > > at > org.apache.zookeeper.server.quorum.QuorumPeer.setAcceptedEpoch(QuorumPeer.java:2262) > > at > org.apache.zookeeper.server.quorum.Learner.registerWithLeader(Learner.java:510) > > at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:91) > > at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1556) > {code} > *Root cause:* The Quorum is blocked in writeLongToFile and can not execute > readPacket, so no timeout exception is arised to trigger the error handler. > Moreover, this problem cannot be handle by add > "-Dlearner.asyncSending=true"(https://issues.apache.org/jira/browse/ZOOKEEPER-4074) -- This message was sent by Atlassian Jira (v8.20.10#820010)