[ https://issues.apache.org/jira/browse/ZOOKEEPER-3757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885207#comment-17885207 ]
Kezhu Wang commented on ZOOKEEPER-3757: --------------------------------------- I think we could solves this by steps: 1. {{getStorageSize}} takes into account the remaining bytes in current log file . 2. Calculate a sensible default for {{txnLogSizeLimitInKb}} to take advantage of it. I guess we clould also introduce heuristic snapshot in future based on criterion, say size of last snapshot, txn log size since last snapshot and etc. {quote} I believe impact in my example is so great because snapCount is raised without setting txnLogSizeLimitInKb to keep transaction logs appropriately sized. In fact I believe similar problems could occur when transaction logs get oversized even if getStorageSize was correct, due to extended seek times in the file while zk searches for the first zxid it really needs. {quote} This is where {{txnLogSizeLimitInKb}} should kick in to avoid large txn log file. We cloud also consider zxid offset index to enable fast seek. But, I doubt it could be overkill since "seek" is rare in ZooKeeper. Keeping txn log small is most likely enough for us, it has negligible negative impact from my perspective. > Transaction log sync can take 20+ seconds after leader election when there is > a large snapCount > ----------------------------------------------------------------------------------------------- > > Key: ZOOKEEPER-3757 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3757 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection > Affects Versions: 3.5.6 > Reporter: Alex Kaiser > Priority: Minor > > Short overview: > If you have a large snapCount (we are using 10,000,000) you can end up with a > very large transaction log (ours are between 1GB - 1.5 GB), which can cause > the sync between a newly elected leader and it's followers to take 20+ > seconds. This stems from the code (FileTxnIterator.getStorageSize()) in most > cases returning 0 even if the transaction log is 1GB. > > Long Explanation: > A few years ago we had some trouble with our zookeeper cluster having many > shortish (100-500ms) pauses during our peak traffic times. These ended up > resulting from the master taking a snap shot. To solve this we upped the > snapCount to 10,000,000 so that we weren't taking snapshots nearly as often. > We also made changes to reduce the size of our snapshots (from around 2.5 GB > to ~500 MB). > I don't remember what version of zookeeper we were using originally, but this > was all working fine using 3.4.10, but we started to have problems when we > upgraded to 3.5.6 around 3 months ago. We have a fairly high transaction > rate and thus end up hitting the zxid overflow about once a month, which will > cause a leader election. When we were on 3.4.10, this was fine because > leader election and syncing would happen within 2-4 seconds, which was low > enough for us to be able to basically ignore it. However after we upgraded > to 3.5.6 the pauses we saw took between 15 - 30 seconds which were > unacceptable for us. > For now to solve this I set zookeeper.forceSnapshotSync=true (yes, I know the > comments say this is only supposed to be used for testing), which causes > syncing using snapshots (only 10-50 MB) instead of the transaction log (1-1.5 > GB). > > Technical details: > I tried taking a look at the code and I think I know why this happens. From > what I learned, it looks like when a follower needs to sync with a leader on > the leader LearnerHandler.syncFollower() gets called. It goes through a big > if statement, but at one point it will call > db.getProposalsFromTxnLog(peerLastZxid, sizeLimit). That peerLastZxid could > be some very old zxid if the follower hadn't taken a snapshot in a long time > (i.e. has a large snapCount) and the sizeLimit will generally be 0.33 * > snapshot size (in my case around 10 MB). > Inside of that getProposalsFromTxnLog it will create a TxnIterator and then > call getStorageSize() on it. The problem comes from the fact that this call > to getStorageSize() will usually return with 0. The reason that happens is > because the FileTxnIterator class has a "current" log file that it is > reading, this.logFile, and a list of files that it would still have to > iterate through, this.storedFiles. The getStroageSize() function though only > looks at the storedFiles list, so if the iterator has one large transaction > log as the "current" log file and nothing in the storedFiles list, then this > method will return 0 even though there is a huge transaction log to sync. > One other side affect of this problem is that even bouncing a follower can > cause long (5-10 second) pauses as the leader will hold a read lock on the > transaction log while it syncs up with the follower. > While I know what the problem is I don't know what the best solution is. I'm > willing to work on the solution, but I would appreciate suggestions. One > possible solution would be to include the this.logFile in the > getStorageSize() calculation, however this could cause the estimate to over > estimate the amount of data that is in the iterator (possibly by a lot) and I > don't know what the consequences of doing that is. I'm not quite sure what > is a good way to get an accurate estimate. > -- This message was sent by Atlassian Jira (v8.20.10#820010)