[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15839860#comment-15839860
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2678:
-------------------------------------------

GitHub user revans2 opened a pull request:

    https://github.com/apache/zookeeper/pull/157

    ZOOKEEPER-2678: Discovery and Sync can take a very long time on large DB

    This patch addresses recovery time when a leader is lost on a large DB.  
    
    It does this by not clearing the DB before leader election begins, and by 
avoiding taking a snapshot as part of the SYNC phase, specifically for a DIFF 
sync. It does this by buffering the proposals and commits just like the code 
currently does for proposals/commits sent after the NEWLEADER and before the 
UPTODATE messages. 
    
    If a SNAP is sent we cannot avoid writing out the full snapshot because 
there is no other way to make sure the disk DB is in sync with what is in 
memory.  So any edits to the edit log before a background snapshot happened 
could possibly be applied on top of an incorrect snapshot.
    
    This same optimization should work for TRUNC too, but I opted not to do it 
for TRUNC because TRUNC is rare and TRUNC by its very nature already forces the 
DB to be reread after the edit logs are modified.  So it would still not be 
fast.
    
    In practice this makes it so instead of taking 5+ mins for the cluster to 
recover from losing a leader it now takes about 3 seconds.
    
    I am happy to port this to 3.5. if it looks good.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/revans2/zookeeper ZOOKEEPER-2678

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/zookeeper/pull/157.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #157
    
----
commit 5aa25620e0189b28d7040305272be2fda28126fb
Author: Robert (Bobby) Evans <ev...@yahoo-inc.com>
Date:   2017-01-19T19:50:32Z

    ZOOKEEPER-2678: Discovery and Sync can take a very long time on large DBs

----


> Large databases take a long time to regain a quorum
> ---------------------------------------------------
>
>                 Key: ZOOKEEPER-2678
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2678
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.9, 3.5.2
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>
> I know this is long but please here me out.
> I recently inherited a massive zookeeper ensemble.  The snapshot is 3.4 GB on 
> disk.  Because of its massive size we have been running into a number of 
> issues. There are lots of problems that we hope to fix with tuning GC etc, 
> but the big one right now that is blocking us making a lot of progress on the 
> rest of them is that when we lose a quorum because the leader left, for what 
> ever reason, it can take well over 5 mins for a new quorum to be established. 
>  So we cannot tune the leader without risking downtime.
> We traced down where the time was being spent and found that each server was 
> clearing the database so it would be read back in again before leader 
> election even started.  Then as part of the sync phase each server will write 
> out a snapshot to checkpoint the progress it made as part of the sync.
> I will be putting up a patch shortly with some proposed changes in it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to