Alex Gvozdenovic created ZOOKEEPER-1465:
-------------------------------------------
Summary: Cluster availability following new leader election takes
a long time with large datasets - is correlated to dataset size
Key: ZOOKEEPER-1465
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1465
Project: ZooKeeper
Issue Type: Bug
Components: leaderElection
Affects Versions: 3.4.3
Reporter: Alex Gvozdenovic
Fix For: 3.4.4
When re-electing a new leader of a cluster, it takes a long time for the
cluster to become available if the dataset is large
Test Data
----------
650mb snapshot size
20k nodes of varied size
3 member cluster
On 3.4.x branch
(http://svn.apache.org/repos/asf/zookeeper/branches/branch-3.4?r=1244779)
------------------------------------------------------------------------------------------
Takes 3-4 minutes to bring up a cluster from cold
Takes 40-50 secs to recover from a leader failure
Takes 10 secs for a new follower to join the cluster
Using the 3.3.5 release on the same hardware with the same dataset
-----------------------------------------------------------------
Takes 10-20 secs to bring up a cluster from cold
Takes 10 secs to recover from a leader failure
Takes 10 secs for a new follower to join the cluster
I can see from the logs in 3.4.x that once a new leader is elected, it pushes a
new snapshot to each of the followers who need to save it before they ack the
leader who can then mark the cluster as available.
The kit being used is a low spec vm so the times taken are not relevant per se
- more the fact that a snapshot is always sent even through there is no
difference between the persisted state on each peer.
No data is being added to the cluster while the peers are being restarted.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira