Alex Kaiser created ZOOKEEPER-3757:
--------------------------------------

             Summary: Transaction log sync can take 20+ seconds after leader 
election when there is a large snapCount
                 Key: ZOOKEEPER-3757
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3757
             Project: ZooKeeper
          Issue Type: Bug
          Components: leaderElection
    Affects Versions: 3.5.6
            Reporter: Alex Kaiser


Short overview:

If you have a large snapCount (we are using 10,000,000) you can end up with a 
very large transaction log (ours are between 1GB - 1.5 GB), which can cause the 
sync between a newly elected leader and it's followers to take 20+ seconds.  
This stems from the code (FileTxnIterator.getStorageSize()) in most cases 
returning 0 even if the transaction log is 1GB.

 

Long Explanation:

A few years ago we had some trouble with our zookeeper cluster having many 
shortish (100-500ms) pauses during our peak traffic times.  These ended up 
resulting from the master taking a snap shot.  To solve this we upped the 
snapCount to 10,000,000 so that we weren't taking snapshots nearly as often.  
We also made changes to reduce the size of our snapshots (from around 2.5 GB to 
~500 MB).

I don't remember what version of zookeeper we were using originally, but this 
was all working fine using 3.4.10, but we started to have problems when we 
upgraded to 3.5.6 around 3 months ago.  We have a fairly high transaction rate 
and thus end up hitting the zxid overflow about once a month, which will cause 
a leader election.  When we were on 3.4.10, this was fine because leader 
election and syncing would happen within 2-4 seconds, which was low enough for 
us to be able to basically ignore it.  However after we upgraded to 3.5.6 the 
pauses we saw took between 15 - 30 seconds which were unacceptable for us.

For now to solve this I set zookeeper.forceSnapshotSync=true (yes, I know the 
comments say this is only supposed to be used for testing), which causes 
syncing using snapshots (only 10-50 MB) instead of the transaction log (1-1.5 
GB).

 

Technical details:

I tried taking a look at the code and I think I know why this happens.  From 
what I learned, it looks like when a follower needs to sync with a leader on 
the leader LearnerHandler.syncFollower() gets called.  It goes through a big if 
statement, but at one point it will call 
db.getProposalsFromTxnLog(peerLastZxid, sizeLimit).  That peerLastZxid could be 
some very old zxid if the follower hadn't taken a snapshot in a long time (i.e. 
has a large snapCount) and the sizeLimit will generally be 0.33 * snapshot size 
(in my case around 10 MB).

Inside of that getProposalsFromTxnLog it will create a TxnIterator and then 
call getStorageSize() on it.  The problem comes from the fact that this call to 
getStorageSize() will usually return with 0.  The reason that happens is 
because the FileTxnIterator class has a "current" log file that it is reading, 
this.logFile, and a list of files that it would still have to iterate through, 
this.storedFiles.  The getStroageSize() function though only looks at the 
storedFiles list, so if the iterator has one large transaction log as the 
"current" log file and nothing in the storedFiles list, then this method will 
return 0 even though there is a huge transaction log to sync.

One other side affect of this problem is that even bouncing a follower can 
cause long (5-10 second) pauses as the leader will hold a read lock on the 
transaction log while it syncs up with the follower.

While I know what the problem is I don't know what the best solution is.  I'm 
willing to work on the solution, but I would appreciate suggestions.  One 
possible solution would be to include the this.logFile in the getStorageSize() 
calculation, however this could cause the estimate to over estimate the amount 
of data that is in the iterator (possibly by a lot) and I don't know what the 
consequences of doing that is.  I'm not quite sure what is a good way to get an 
accurate estimate.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to