[jira] [Comment Edited] (ZOOKEEPER-3757) Transaction log sync can take 20+ seconds after leader election when there is a large snapCount

Brett Folkins (Jira) Wed, 25 Sep 2024 15:48:05 -0700


    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884804#comment-17884804
 ]


Brett Folkins edited comment on ZOOKEEPER-3757 at 9/25/24 10:47 PM:
--------------------------------------------------------------------

I saw this issue in production recently on 3.6.2.

Steps to reproduce:
1) set snapCount high (10 million)
2) wait for the txlog to roll over on the leader, since we need it to have room 
to grow in step 4 without rolling
3) shut off a follower
4) write a few million transactions (in my case I watched the txlog on the 
leader until it grew to 2gb, which was ~6 million transactions)
5) turn the follower back on

Now the leader will try to sync the follower by replaying  all 2gb of 
transactions instead of sending a 20mb snapshot of the current data tree. This 
goes past the 10 second default timeout, fails, and restarts forever. Meanwhile 
the sync process is holding a transaction log lock that prevents the leader 
from processing writes. The whole cluster fails until something stops the 
attempts to sync from the follower, rolls the transaction log on the leader, or 
changes the leadership.

log sample from the leader (this loop repeats):
{noformat}
2024-09-25 22:10:28,018 [myid:2] - WARN  
[LearnerHandler-/10.9.145.61:35398:LearnerHandler@737] - ******* GOODBYE 
/10.9.145.61:35398 ********
2024-09-25 22:10:28,238 [myid:2] - INFO  
[LearnerHandler-/10.9.145.61:35456:LearnerHandler@979] - Sending DIFF 
zxid=0x2f00871bdd  for peer sid: 1
2024-09-25 22:10:43,899 [myid:2] - INFO  
[WorkerReceiver[myid=2]:FastLeaderElection$Messenger$WorkerReceiver@389] - 
Notification: my state:LEADING; n.sid:1, n.state:LOOKING, n.leader:1, 
n.round:0x20, n.peerEpoch:0x2f, n.zxid:0x2f0028ce99, message format 
version:0x2, n.config version:0x2f0071c25a
2024-09-25 22:10:43,907 [myid:2] - INFO  
[LearnerHandler-/10.9.145.61:35562:LearnerHandler@504] - Follower sid: 1 : info 
: 10.9.145.61:2888:3888:participant;0.0.0.0:2181
2024-09-25 22:10:43,909 [myid:2] - INFO  
[LearnerHandler-/10.9.145.61:35562:ZKDatabase@345] - On disk txn sync enabled 
with snapshotSizeFactor 0.33
2024-09-25 22:10:51,945 [myid:2] - INFO  
[LearnerHandler-/10.9.145.61:35562:LearnerHandler@807] - Synchronizing with 
Learner sid: 1 maxCommittedLog=0x2f00871bde minCommittedLog=0x2f008719ea 
lastProcessedZxid=0x2f00871bde peerLastZxid=0x2f0028ce99
2024-09-25 22:10:51,945 [myid:2] - ERROR 
[LearnerHandler-/10.9.145.61:35456:LearnerHandler@714] - Unexpected exception 
causing shutdown while sock still open
java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at 
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:96)
        at 
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:86)
        at 
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:134)
        at 
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:611)
2024-09-25 22:10:51,946 [myid:2] - WARN  
[LearnerHandler-/10.9.145.61:35456:LearnerHandler@737] - ******* GOODBYE 
/10.9.145.61:35456 ********
2024-09-25 22:10:51,947 [myid:2] - INFO  
[LearnerHandler-/10.9.145.61:35562:LearnerHandler@884] - Use txnlog and 
committedLog for peer sid: 1
2024-09-25 22:10:52,854 [myid:2] - INFO  
[LearnerHandler-/10.9.145.61:35562:LearnerHandler@979] - Sending DIFF 
zxid=0x2f00871bde  for peer sid: 1
2024-09-25 22:11:03,948 [myid:2] - INFO  
[WorkerReceiver[myid=2]:FastLeaderElection$Messenger$WorkerReceiver@389] - 
Notification: my state:LEADING; n.sid:1, n.state:LOOKING, n.leader:1, 
n.round:0x20, n.peerEpoch:0x2f, n.zxid:0x2f0028ce99, message format 
version:0x2, n.config version:0x2f0071c25a
2024-09-25 22:11:03,953 [myid:2] - INFO  
[LearnerHandler-/10.9.145.61:35612:LearnerHandler@504] - Follower sid: 1 : info 
: 10.9.145.61:2888:3888:participant;0.0.0.0:2181
2024-09-25 22:11:03,954 [myid:2] - INFO  
[LearnerHandler-/10.9.145.61:35612:ZKDatabase@345] - On disk txn sync enabled 
with snapshotSizeFactor 0.33
2024-09-25 22:11:14,309 [myid:2] - INFO  
[LearnerHandler-/10.9.145.61:35612:LearnerHandler@807] - Synchronizing with 
Learner sid: 1 maxCommittedLog=0x2f00871bdf minCommittedLog=0x2f008719eb 
lastProcessedZxid=0x2f00871bdf peerLastZxid=0x2f0028ce99
2024-09-25 22:11:14,311 [myid:2] - ERROR 
[LearnerHandler-/10.9.145.61:35562:LearnerHandler@714] - Unexpected exception 
causing shutdown while sock still open{noformat}
And the logs from the follower have this on repeat:
{noformat}
2024-09-25 22:13:29,722 [myid:1] - INFO  
[QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled):QuorumPeer@863]
 - Peer state changed: following - synchronization - diff
2024-09-25 22:13:49,741 [myid:1] - WARN  
[QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled):Follower@129]
 - Exception when following the leader
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at 
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:96)
        at 
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:86)
        at 
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:134)
        at 
org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:182)
        at 
org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:506)
        at 
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:107)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1458){noformat}
Fixing the getStorageSize calculation should make the decision to use txlog 
sync vs snapshot sync accurate.

I believe impact in my example is so great because snapCount is raised without 
setting _txnLogSizeLimitInKb_ to keep transaction logs appropriately sized. In 
fact I believe similar problems could occur when transaction logs get oversized 
even if getStorageSize was correct, due to extended seek times in the file 
while zk searches for the first zxid it really needs.

On default settings, the impact is limited to 100k transactions (default 
snapCount) and likely is only adding a few hundred millis of delay when joining.

 


was (Author: JIRAUSER306882):
I saw this issue in production recently on 3.6.2.

Steps to reproduce:
1) set snapCount high (10 million)
2) wait for the txlog to roll over on the leader, since we need it to have room 
to grow in step 4 without rolling
3) shut off a follower
4) write a few million transactions (in my case I watched the txlog on the 
leader until it grew to 2gb, which was ~6 million transactions)
5) turn the follower back on

Now the leader will try to sync the follower by replaying  all 2gb of 
transactions instead of sending a 20mb snapshot of the current data tree. This 
goes past the 10 second default timeout, fails, and restarts forever. Meanwhile 
the sync process is holding a transaction log lock that prevents the leader 
from processing writes. The whole cluster fails until something stops the 
attempts to sync from the follower or rolls the transaction log on the leader. 
Brand new hosts would sync with a snapshot so they will join successfully.

log sample from the leader (this loop repeats):
{noformat}
2024-09-25 22:10:28,018 [myid:2] - WARN  
[LearnerHandler-/10.9.145.61:35398:LearnerHandler@737] - ******* GOODBYE 
/10.9.145.61:35398 ********
2024-09-25 22:10:28,238 [myid:2] - INFO  
[LearnerHandler-/10.9.145.61:35456:LearnerHandler@979] - Sending DIFF 
zxid=0x2f00871bdd  for peer sid: 1
2024-09-25 22:10:43,899 [myid:2] - INFO  
[WorkerReceiver[myid=2]:FastLeaderElection$Messenger$WorkerReceiver@389] - 
Notification: my state:LEADING; n.sid:1, n.state:LOOKING, n.leader:1, 
n.round:0x20, n.peerEpoch:0x2f, n.zxid:0x2f0028ce99, message format 
version:0x2, n.config version:0x2f0071c25a
2024-09-25 22:10:43,907 [myid:2] - INFO  
[LearnerHandler-/10.9.145.61:35562:LearnerHandler@504] - Follower sid: 1 : info 
: 10.9.145.61:2888:3888:participant;0.0.0.0:2181
2024-09-25 22:10:43,909 [myid:2] - INFO  
[LearnerHandler-/10.9.145.61:35562:ZKDatabase@345] - On disk txn sync enabled 
with snapshotSizeFactor 0.33
2024-09-25 22:10:51,945 [myid:2] - INFO  
[LearnerHandler-/10.9.145.61:35562:LearnerHandler@807] - Synchronizing with 
Learner sid: 1 maxCommittedLog=0x2f00871bde minCommittedLog=0x2f008719ea 
lastProcessedZxid=0x2f00871bde peerLastZxid=0x2f0028ce99
2024-09-25 22:10:51,945 [myid:2] - ERROR 
[LearnerHandler-/10.9.145.61:35456:LearnerHandler@714] - Unexpected exception 
causing shutdown while sock still open
java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at 
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:96)
        at 
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:86)
        at 
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:134)
        at 
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:611)
2024-09-25 22:10:51,946 [myid:2] - WARN  
[LearnerHandler-/10.9.145.61:35456:LearnerHandler@737] - ******* GOODBYE 
/10.9.145.61:35456 ********
2024-09-25 22:10:51,947 [myid:2] - INFO  
[LearnerHandler-/10.9.145.61:35562:LearnerHandler@884] - Use txnlog and 
committedLog for peer sid: 1
2024-09-25 22:10:52,854 [myid:2] - INFO  
[LearnerHandler-/10.9.145.61:35562:LearnerHandler@979] - Sending DIFF 
zxid=0x2f00871bde  for peer sid: 1
2024-09-25 22:11:03,948 [myid:2] - INFO  
[WorkerReceiver[myid=2]:FastLeaderElection$Messenger$WorkerReceiver@389] - 
Notification: my state:LEADING; n.sid:1, n.state:LOOKING, n.leader:1, 
n.round:0x20, n.peerEpoch:0x2f, n.zxid:0x2f0028ce99, message format 
version:0x2, n.config version:0x2f0071c25a
2024-09-25 22:11:03,953 [myid:2] - INFO  
[LearnerHandler-/10.9.145.61:35612:LearnerHandler@504] - Follower sid: 1 : info 
: 10.9.145.61:2888:3888:participant;0.0.0.0:2181
2024-09-25 22:11:03,954 [myid:2] - INFO  
[LearnerHandler-/10.9.145.61:35612:ZKDatabase@345] - On disk txn sync enabled 
with snapshotSizeFactor 0.33
2024-09-25 22:11:14,309 [myid:2] - INFO  
[LearnerHandler-/10.9.145.61:35612:LearnerHandler@807] - Synchronizing with 
Learner sid: 1 maxCommittedLog=0x2f00871bdf minCommittedLog=0x2f008719eb 
lastProcessedZxid=0x2f00871bdf peerLastZxid=0x2f0028ce99
2024-09-25 22:11:14,311 [myid:2] - ERROR 
[LearnerHandler-/10.9.145.61:35562:LearnerHandler@714] - Unexpected exception 
causing shutdown while sock still open{noformat}
And the logs from the follower have this on repeat:
{noformat}
2024-09-25 22:13:29,722 [myid:1] - INFO  
[QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled):QuorumPeer@863]
 - Peer state changed: following - synchronization - diff
2024-09-25 22:13:49,741 [myid:1] - WARN  
[QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled):Follower@129]
 - Exception when following the leader
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at 
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:96)
        at 
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:86)
        at 
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:134)
        at 
org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:182)
        at 
org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:506)
        at 
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:107)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1458){noformat}

Fixing the getStorageSize calculation should make the decision to use txlog 
sync vs snapshot sync accurate.

I believe impact in my example is so great because snapCount is raised without 
setting _txnLogSizeLimitInKb_ to keep transaction logs appropriately sized. In 
fact I believe similar problems could occur when transaction logs get oversized 
even if getStorageSize was correct, due to extended seek times in the file 
while zk searches for the first zxid it really needs. 

On default settings, the impact is limited to 100k transactions (default 
snapCount) and likely is only adding a few hundred millis of delay when joining.

 

> Transaction log sync can take 20+ seconds after leader election when there is 
> a large snapCount
> -----------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3757
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3757
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection
>    Affects Versions: 3.5.6
>            Reporter: Alex Kaiser
>            Priority: Minor
>
> Short overview:
> If you have a large snapCount (we are using 10,000,000) you can end up with a 
> very large transaction log (ours are between 1GB - 1.5 GB), which can cause 
> the sync between a newly elected leader and it's followers to take 20+ 
> seconds.  This stems from the code (FileTxnIterator.getStorageSize()) in most 
> cases returning 0 even if the transaction log is 1GB.
>  
> Long Explanation:
> A few years ago we had some trouble with our zookeeper cluster having many 
> shortish (100-500ms) pauses during our peak traffic times.  These ended up 
> resulting from the master taking a snap shot.  To solve this we upped the 
> snapCount to 10,000,000 so that we weren't taking snapshots nearly as often.  
> We also made changes to reduce the size of our snapshots (from around 2.5 GB 
> to ~500 MB).
> I don't remember what version of zookeeper we were using originally, but this 
> was all working fine using 3.4.10, but we started to have problems when we 
> upgraded to 3.5.6 around 3 months ago.  We have a fairly high transaction 
> rate and thus end up hitting the zxid overflow about once a month, which will 
> cause a leader election.  When we were on 3.4.10, this was fine because 
> leader election and syncing would happen within 2-4 seconds, which was low 
> enough for us to be able to basically ignore it.  However after we upgraded 
> to 3.5.6 the pauses we saw took between 15 - 30 seconds which were 
> unacceptable for us.
> For now to solve this I set zookeeper.forceSnapshotSync=true (yes, I know the 
> comments say this is only supposed to be used for testing), which causes 
> syncing using snapshots (only 10-50 MB) instead of the transaction log (1-1.5 
> GB).
>  
> Technical details:
> I tried taking a look at the code and I think I know why this happens.  From 
> what I learned, it looks like when a follower needs to sync with a leader on 
> the leader LearnerHandler.syncFollower() gets called.  It goes through a big 
> if statement, but at one point it will call 
> db.getProposalsFromTxnLog(peerLastZxid, sizeLimit).  That peerLastZxid could 
> be some very old zxid if the follower hadn't taken a snapshot in a long time 
> (i.e. has a large snapCount) and the sizeLimit will generally be 0.33 * 
> snapshot size (in my case around 10 MB).
> Inside of that getProposalsFromTxnLog it will create a TxnIterator and then 
> call getStorageSize() on it.  The problem comes from the fact that this call 
> to getStorageSize() will usually return with 0.  The reason that happens is 
> because the FileTxnIterator class has a "current" log file that it is 
> reading, this.logFile, and a list of files that it would still have to 
> iterate through, this.storedFiles.  The getStroageSize() function though only 
> looks at the storedFiles list, so if the iterator has one large transaction 
> log as the "current" log file and nothing in the storedFiles list, then this 
> method will return 0 even though there is a huge transaction log to sync.
> One other side affect of this problem is that even bouncing a follower can 
> cause long (5-10 second) pauses as the leader will hold a read lock on the 
> transaction log while it syncs up with the follower.
> While I know what the problem is I don't know what the best solution is.  I'm 
> willing to work on the solution, but I would appreciate suggestions.  One 
> possible solution would be to include the this.logFile in the 
> getStorageSize() calculation, however this could cause the estimate to over 
> estimate the amount of data that is in the iterator (possibly by a lot) and I 
> don't know what the consequences of doing that is.  I'm not quite sure what 
> is a good way to get an accurate estimate.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (ZOOKEEPER-3757) Transaction log sync can take 20+ seconds after leader election when there is a large snapCount

Reply via email to