[ https://issues.apache.org/jira/browse/ZOOKEEPER-3757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17906414#comment-17906414 ]
Filippo Ghibellini commented on ZOOKEEPER-3757: ----------------------------------------------- Experienced the same problem in production, tracked it down to the exact same cause. In our case the connection is terminated due to a [`Closing connection to peer due to transaction timeout.`|https://github.com/apache/zookeeper/blob/f7af2acd7f92868507e66780bf6818f47d068a77/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/LearnerHandler.java#L1079] after ~1min±20s and not 10s. This seems to be related to the fact that [the timeout starts being enforced only once the leader starts sending data to the follower|https://github.com/apache/zookeeper/blob/f7af2acd7f92868507e66780bf6818f47d068a77/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/LearnerHandler.java#L1070]. We've also observed that in our case the follower does make progress with each synchronization attempt (the `Synchronizing with Learner sid: ... peerLastZxid=<zxid>` differs between the iterations). > Transaction log sync can take 20+ seconds after leader election when there is > a large snapCount > ----------------------------------------------------------------------------------------------- > > Key: ZOOKEEPER-3757 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3757 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection > Affects Versions: 3.5.6 > Reporter: Alex Kaiser > Priority: Minor > > Short overview: > If you have a large snapCount (we are using 10,000,000) you can end up with a > very large transaction log (ours are between 1GB - 1.5 GB), which can cause > the sync between a newly elected leader and it's followers to take 20+ > seconds. This stems from the code (FileTxnIterator.getStorageSize()) in most > cases returning 0 even if the transaction log is 1GB. > > Long Explanation: > A few years ago we had some trouble with our zookeeper cluster having many > shortish (100-500ms) pauses during our peak traffic times. These ended up > resulting from the master taking a snap shot. To solve this we upped the > snapCount to 10,000,000 so that we weren't taking snapshots nearly as often. > We also made changes to reduce the size of our snapshots (from around 2.5 GB > to ~500 MB). > I don't remember what version of zookeeper we were using originally, but this > was all working fine using 3.4.10, but we started to have problems when we > upgraded to 3.5.6 around 3 months ago. We have a fairly high transaction > rate and thus end up hitting the zxid overflow about once a month, which will > cause a leader election. When we were on 3.4.10, this was fine because > leader election and syncing would happen within 2-4 seconds, which was low > enough for us to be able to basically ignore it. However after we upgraded > to 3.5.6 the pauses we saw took between 15 - 30 seconds which were > unacceptable for us. > For now to solve this I set zookeeper.forceSnapshotSync=true (yes, I know the > comments say this is only supposed to be used for testing), which causes > syncing using snapshots (only 10-50 MB) instead of the transaction log (1-1.5 > GB). > > Technical details: > I tried taking a look at the code and I think I know why this happens. From > what I learned, it looks like when a follower needs to sync with a leader on > the leader LearnerHandler.syncFollower() gets called. It goes through a big > if statement, but at one point it will call > db.getProposalsFromTxnLog(peerLastZxid, sizeLimit). That peerLastZxid could > be some very old zxid if the follower hadn't taken a snapshot in a long time > (i.e. has a large snapCount) and the sizeLimit will generally be 0.33 * > snapshot size (in my case around 10 MB). > Inside of that getProposalsFromTxnLog it will create a TxnIterator and then > call getStorageSize() on it. The problem comes from the fact that this call > to getStorageSize() will usually return with 0. The reason that happens is > because the FileTxnIterator class has a "current" log file that it is > reading, this.logFile, and a list of files that it would still have to > iterate through, this.storedFiles. The getStroageSize() function though only > looks at the storedFiles list, so if the iterator has one large transaction > log as the "current" log file and nothing in the storedFiles list, then this > method will return 0 even though there is a huge transaction log to sync. > One other side affect of this problem is that even bouncing a follower can > cause long (5-10 second) pauses as the leader will hold a read lock on the > transaction log while it syncs up with the follower. > While I know what the problem is I don't know what the best solution is. I'm > willing to work on the solution, but I would appreciate suggestions. One > possible solution would be to include the this.logFile in the > getStorageSize() calculation, however this could cause the estimate to over > estimate the amount of data that is in the iterator (possibly by a lot) and I > don't know what the consequences of doing that is. I'm not quite sure what > is a good way to get an accurate estimate. > -- This message was sent by Atlassian Jira (v8.20.10#820010)