[jira] [Commented] (ZOOKEEPER-2678) Large databases take a long time to regain a quorum

ASF GitHub Bot (JIRA) Sat, 28 Jan 2017 11:16:37 -0800

    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15844164#comment-15844164
 ]


ASF GitHub Bot commented on ZOOKEEPER-2678:
-------------------------------------------

Github user fpj commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/157#discussion_r98336253
  
    --- Diff: src/java/main/org/apache/zookeeper/server/quorum/Learner.java ---
    @@ -364,10 +367,12 @@ else if (qp.getType() == Leader.SNAP) {
                 
                 long lastQueued = 0;
     
    -            // in V1.0 we take a snapshot when we get the NEWLEADER 
message, but in pre V1.0
    +            // in Zab V1.0 (ZK 3.4+) we might take a snapshot when we get 
the NEWLEADER message, but in pre V1.0
                 // we take the snapshot at the UPDATE, since V1.0 also gets 
the UPDATE (after the NEWLEADER)
                 // we need to make sure that we don't take the snapshot twice.
    -            boolean snapshotTaken = false;
    +            boolean isPreZAB1_0 = true;
    +            //If we are not going to take the snapshot be sure the edits 
are not applied in memory
    +            boolean writeToEditLog = !snapshotNeeded;
    --- End diff --
    
    The changes here are using `edit` to refer to `txns`. I'd rather use `txn` 
to be consistent across the project. Specifically here, you're using `EditLog` 
to refer to the `TxnLog`, please change accordingly to have it consistent 
across the project.


> Large databases take a long time to regain a quorum
> ---------------------------------------------------
>
>                 Key: ZOOKEEPER-2678
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2678
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.9, 3.5.2
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>
> I know this is long but please here me out.
> I recently inherited a massive zookeeper ensemble.  The snapshot is 3.4 GB on 
> disk.  Because of its massive size we have been running into a number of 
> issues. There are lots of problems that we hope to fix with tuning GC etc, 
> but the big one right now that is blocking us making a lot of progress on the 
> rest of them is that when we lose a quorum because the leader left, for what 
> ever reason, it can take well over 5 mins for a new quorum to be established. 
>  So we cannot tune the leader without risking downtime.
> We traced down where the time was being spent and found that each server was 
> clearing the database so it would be read back in again before leader 
> election even started.  Then as part of the sync phase each server will write 
> out a snapshot to checkpoint the progress it made as part of the sync.
> I will be putting up a patch shortly with some proposed changes in it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-2678) Large databases take a long time to regain a quorum

Reply via email to