[jira] [Commented] (ZOOKEEPER-3757) Transaction log sync can take 20+ seconds after leader election when there is a large snapCount

Filippo Ghibellini (Jira) Tue, 17 Dec 2024 05:10:51 -0800


    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17906414#comment-17906414
 ]


Filippo Ghibellini commented on ZOOKEEPER-3757:
-----------------------------------------------

Experienced the same problem in production, tracked it down to the exact same 
cause.

In our case the connection is terminated due to a [`Closing connection to peer 
due to transaction 
timeout.`|https://github.com/apache/zookeeper/blob/f7af2acd7f92868507e66780bf6818f47d068a77/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/LearnerHandler.java#L1079]
 after ~1min±20s and not 10s. This seems to be related to the fact that [the 
timeout starts being enforced only once the leader starts sending data to the 
follower|https://github.com/apache/zookeeper/blob/f7af2acd7f92868507e66780bf6818f47d068a77/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/LearnerHandler.java#L1070].

We've also observed that in our case the follower does make progress with each 
synchronization attempt (the `Synchronizing with Learner sid: ... 
peerLastZxid=<zxid>` differs between the iterations).

> Transaction log sync can take 20+ seconds after leader election when there is 
> a large snapCount
> -----------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3757
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3757
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection
>    Affects Versions: 3.5.6
>            Reporter: Alex Kaiser
>            Priority: Minor
>
> Short overview:
> If you have a large snapCount (we are using 10,000,000) you can end up with a 
> very large transaction log (ours are between 1GB - 1.5 GB), which can cause 
> the sync between a newly elected leader and it's followers to take 20+ 
> seconds.  This stems from the code (FileTxnIterator.getStorageSize()) in most 
> cases returning 0 even if the transaction log is 1GB.
>  
> Long Explanation:
> A few years ago we had some trouble with our zookeeper cluster having many 
> shortish (100-500ms) pauses during our peak traffic times.  These ended up 
> resulting from the master taking a snap shot.  To solve this we upped the 
> snapCount to 10,000,000 so that we weren't taking snapshots nearly as often.  
> We also made changes to reduce the size of our snapshots (from around 2.5 GB 
> to ~500 MB).
> I don't remember what version of zookeeper we were using originally, but this 
> was all working fine using 3.4.10, but we started to have problems when we 
> upgraded to 3.5.6 around 3 months ago.  We have a fairly high transaction 
> rate and thus end up hitting the zxid overflow about once a month, which will 
> cause a leader election.  When we were on 3.4.10, this was fine because 
> leader election and syncing would happen within 2-4 seconds, which was low 
> enough for us to be able to basically ignore it.  However after we upgraded 
> to 3.5.6 the pauses we saw took between 15 - 30 seconds which were 
> unacceptable for us.
> For now to solve this I set zookeeper.forceSnapshotSync=true (yes, I know the 
> comments say this is only supposed to be used for testing), which causes 
> syncing using snapshots (only 10-50 MB) instead of the transaction log (1-1.5 
> GB).
>  
> Technical details:
> I tried taking a look at the code and I think I know why this happens.  From 
> what I learned, it looks like when a follower needs to sync with a leader on 
> the leader LearnerHandler.syncFollower() gets called.  It goes through a big 
> if statement, but at one point it will call 
> db.getProposalsFromTxnLog(peerLastZxid, sizeLimit).  That peerLastZxid could 
> be some very old zxid if the follower hadn't taken a snapshot in a long time 
> (i.e. has a large snapCount) and the sizeLimit will generally be 0.33 * 
> snapshot size (in my case around 10 MB).
> Inside of that getProposalsFromTxnLog it will create a TxnIterator and then 
> call getStorageSize() on it.  The problem comes from the fact that this call 
> to getStorageSize() will usually return with 0.  The reason that happens is 
> because the FileTxnIterator class has a "current" log file that it is 
> reading, this.logFile, and a list of files that it would still have to 
> iterate through, this.storedFiles.  The getStroageSize() function though only 
> looks at the storedFiles list, so if the iterator has one large transaction 
> log as the "current" log file and nothing in the storedFiles list, then this 
> method will return 0 even though there is a huge transaction log to sync.
> One other side affect of this problem is that even bouncing a follower can 
> cause long (5-10 second) pauses as the leader will hold a read lock on the 
> transaction log while it syncs up with the follower.
> While I know what the problem is I don't know what the best solution is.  I'm 
> willing to work on the solution, but I would appreciate suggestions.  One 
> possible solution would be to include the this.logFile in the 
> getStorageSize() calculation, however this could cause the estimate to over 
> estimate the amount of data that is in the iterator (possibly by a lot) and I 
> don't know what the consequences of doing that is.  I'm not quite sure what 
> is a good way to get an accurate estimate.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ZOOKEEPER-3757) Transaction log sync can take 20+ seconds after leader election when there is a large snapCount

Reply via email to