Robert Joseph Evans created ZOOKEEPER-2678:
----------------------------------------------

             Summary: Large databases take a long time to regain a quorum
                 Key: ZOOKEEPER-2678
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2678
             Project: ZooKeeper
          Issue Type: Bug
          Components: server
    Affects Versions: 3.5.2, 3.4.9
            Reporter: Robert Joseph Evans
            Assignee: Robert Joseph Evans


I know this is long but please here me out.

I recently inherited a massive zookeeper ensemble.  The snapshot is 3.4 GB on 
disk.  Because of its massive size we have been running into a number of 
issues. There are lots of problems that we hope to fix with tuning GC etc, but 
the big one right now that is blocking us making a lot of progress on the rest 
of them is that when we lose a quorum because the leader left, for what ever 
reason, it can take well over 5 mins for a new quorum to be established.  So we 
cannot tune the leader without risking downtime.

We traced down where the time was being spent and found that each server was 
clearing the database so it would be read back in again before leader election 
even started.  Then as part of the sync phase each server will write out a 
snapshot to checkpoint the progress it made as part of the sync.

I will be putting up a patch shortly with some proposed changes in it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to