[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14658861#comment-14658861
 ] 

Edward Ribeiro commented on ZOOKEEPER-2240:
-------------------------------------------

Hi [~elyograg], I would like to suggest some piece of information about the 
Reliability and Fault Tolerance Guarantees of ZooKeeper (just a suggestion, 
feel free to change or even ignore the text according your preference). :)

{quote}
"ZooKeeper service provides availability and reliability in spite of ensemble 
nodes' failures, but it requires a majority of servers to be up so that the 
service can still be available. The numbers of nodes required to provide a 
given degree of availability can be calculated by the following formula:

N = 2 * F + 1,

where 'N' is the total number of nodes and 'F' is how many ZK nodes can fail 
(or be down) simultaneously. 

For example, for F=1 we have:

N = 2 * 1 + 1 
N = 3

Therefore, ZK emsemble should be comprised of at least 3 nodes to support a 
single node failure. If two nodes fail in the above scenario, the ZooKeeper 
service becomes unavailable, because the majority (2 nodes) is lost.

For supporting 2 node failures, we would have:

N = 2 * 2 + 1
N = 5

That is, even if two servers fail at the same time, the service will still be 
up, because the other 3 nodes form the majority of the ensemble.

Typically 3 servers is more than sufficient. However for online production 
serving environments a 5 node ensemble is a good choice. It allows you to take 
1 server down for scheduled maintenance, for example. If you have 5 servers you 
can stay up even if one of the remaining active servers fails unexpectedly.

It is important to note that we adding more servers into the ZooKeeper 
ensemble, increasing the quorum participants, will make the write performance 
drops, so it's not recommended to over provisioning the number of nodes, that 
is, making a with more than 11 nodes, for example.
{quote}

> Make the three-node minimum more explicit in documentation and on website
> -------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2240
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2240
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: documentation
>            Reporter: Shawn Heisey
>            Assignee: Shawn Heisey
>            Priority: Trivial
>             Fix For: 3.4.7, 3.5.2, 3.6.0
>
>
> One of the most important parts of a production zookeeper deployment is the 
> three-node minimum requirement for fault tolerance ... but when I glance at 
> the website and the documentation, this requirement is difficult to actually 
> find.
> It is buried deep in the admin documentation, in a sentence that says "Thus, 
> a deployment that consists of three machines can handle one failure, and a 
> deployment of five machines can handle two failures."  Other parts of the 
> documentation hint at it, but nothing that I've seen comes out and explicitly 
> says it.
> Ideally, documentation about this requirement would be in a location where it 
> can be easily pinpointed with a targeted URL, so I can point to ZK 
> documentation with a link and clearly tell SolrCloud users that this is a 
> real requirement.
> If someone can point me to version control locations where I can check out or 
> clone the docs and the website, I'm happy to attempt a patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to