[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15844165#comment-15844165
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2678:
-------------------------------------------

Github user fpj commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/157#discussion_r98337403
  
    --- Diff: src/java/test/org/apache/zookeeper/server/quorum/Zab1_0Test.java 
---
    @@ -839,6 +839,13 @@ public void converseWithFollower(InputArchive ia, 
OutputArchive oa,
                         Assert.assertEquals(1, f.self.getAcceptedEpoch());
                         Assert.assertEquals(1, f.self.getCurrentEpoch());
                         
    +                    //Wait for the edits to be written out
    --- End diff --
    
    I need to think some more whether it makes any sense to add test cases for 
this. The test cases we already have probably cover this enough given that 
there is no real change of behavior.
    
    This change here is necessary, though. We don't really care about time in 
general in our tests because we can never be sure of the timing we will get 
across runs and with different settings.


> Large databases take a long time to regain a quorum
> ---------------------------------------------------
>
>                 Key: ZOOKEEPER-2678
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2678
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.9, 3.5.2
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>
> I know this is long but please here me out.
> I recently inherited a massive zookeeper ensemble.  The snapshot is 3.4 GB on 
> disk.  Because of its massive size we have been running into a number of 
> issues. There are lots of problems that we hope to fix with tuning GC etc, 
> but the big one right now that is blocking us making a lot of progress on the 
> rest of them is that when we lose a quorum because the leader left, for what 
> ever reason, it can take well over 5 mins for a new quorum to be established. 
>  So we cannot tune the leader without risking downtime.
> We traced down where the time was being spent and found that each server was 
> clearing the database so it would be read back in again before leader 
> election even started.  Then as part of the sync phase each server will write 
> out a snapshot to checkpoint the progress it made as part of the sync.
> I will be putting up a patch shortly with some proposed changes in it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to