http://git-wip-us.apache.org/repos/asf/zookeeper/blob/4607a3e1/zookeeper-docs/src/documentation/content/xdocs/zookeeperHierarchicalQuorums.xml ---------------------------------------------------------------------- diff --git a/zookeeper-docs/src/documentation/content/xdocs/zookeeperHierarchicalQuorums.xml b/zookeeper-docs/src/documentation/content/xdocs/zookeeperHierarchicalQuorums.xml new file mode 100644 index 0000000..f71c4a8 --- /dev/null +++ b/zookeeper-docs/src/documentation/content/xdocs/zookeeperHierarchicalQuorums.xml @@ -0,0 +1,75 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!-- + Copyright 2002-2004 The Apache Software Foundation + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +<!DOCTYPE article PUBLIC "-//OASIS//DTD Simplified DocBook XML V1.0//EN" +"http://www.oasis-open.org/docbook/xml/simple/1.0/sdocbook.dtd"> +<article id="zk_HierarchicalQuorums"> + <title>Introduction to hierarchical quorums</title> + + <articleinfo> + <legalnotice> + <para>Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. You may + obtain a copy of the License at <ulink + url="http://www.apache.org/licenses/LICENSE-2.0">http://www.apache.org/licenses/LICENSE-2.0</ulink>.</para> + + <para>Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an "AS IS" + BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied. See the License for the specific language governing permissions + and limitations under the License.</para> + </legalnotice> + + <abstract> + <para>This document contains information about hierarchical quorums.</para> + </abstract> + </articleinfo> + + <para> + This document gives an example of how to use hierarchical quorums. The basic idea is + very simple. First, we split servers into groups, and add a line for each group listing + the servers that form this group. Next we have to assign a weight to each server. + </para> + + <para> + The following example shows how to configure a system with three groups of three servers + each, and we assign a weight of 1 to each server: + </para> + + <programlisting> + group.1=1:2:3 + group.2=4:5:6 + group.3=7:8:9 + + weight.1=1 + weight.2=1 + weight.3=1 + weight.4=1 + weight.5=1 + weight.6=1 + weight.7=1 + weight.8=1 + weight.9=1 + </programlisting> + + <para> + When running the system, we are able to form a quorum once we have a majority of votes from + a majority of non-zero-weight groups. Groups that have zero weight are discarded and not + considered when forming quorums. Looking at the example, we are able to form a quorum once + we have votes from at least two servers from each of two different groups. + </para> + </article> \ No newline at end of file
http://git-wip-us.apache.org/repos/asf/zookeeper/blob/4607a3e1/zookeeper-docs/src/documentation/content/xdocs/zookeeperInternals.xml ---------------------------------------------------------------------- diff --git a/zookeeper-docs/src/documentation/content/xdocs/zookeeperInternals.xml b/zookeeper-docs/src/documentation/content/xdocs/zookeeperInternals.xml new file mode 100644 index 0000000..7815bc1 --- /dev/null +++ b/zookeeper-docs/src/documentation/content/xdocs/zookeeperInternals.xml @@ -0,0 +1,487 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!-- + Copyright 2002-2004 The Apache Software Foundation + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +<!DOCTYPE article PUBLIC "-//OASIS//DTD Simplified DocBook XML V1.0//EN" +"http://www.oasis-open.org/docbook/xml/simple/1.0/sdocbook.dtd"> +<article id="ar_ZooKeeperInternals"> + <title>ZooKeeper Internals</title> + + <articleinfo> + <legalnotice> + <para>Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. You may + obtain a copy of the License at <ulink + url="http://www.apache.org/licenses/LICENSE-2.0">http://www.apache.org/licenses/LICENSE-2.0</ulink>.</para> + + <para>Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an "AS IS" + BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied. See the License for the specific language governing permissions + and limitations under the License.</para> + </legalnotice> + + <abstract> + <para>This article contains topics which discuss the inner workings of + ZooKeeper. So far, that's logging and atomic broadcast. </para> + + </abstract> + </articleinfo> + + <section id="ch_Introduction"> + <title>Introduction</title> + + <para>This document contains information on the inner workings of ZooKeeper. + So far, it discusses these topics: + </para> + +<itemizedlist> +<listitem><para><xref linkend="sc_atomicBroadcast"/></para></listitem> +<listitem><para><xref linkend="sc_logging"/></para></listitem> +</itemizedlist> + +</section> + +<section id="sc_atomicBroadcast"> +<title>Atomic Broadcast</title> + +<para> +At the heart of ZooKeeper is an atomic messaging system that keeps all of the servers in sync.</para> + +<section id="sc_guaranteesPropertiesDefinitions"><title>Guarantees, Properties, and Definitions</title> +<para> +The specific guarantees provided by the messaging system used by ZooKeeper are the following:</para> + +<variablelist> + +<varlistentry><term><emphasis >Reliable delivery</emphasis></term> +<listitem><para>If a message, m, is delivered +by one server, it will be eventually delivered by all servers.</para></listitem></varlistentry> + +<varlistentry><term><emphasis >Total order</emphasis></term> +<listitem><para> If a message is +delivered before message b by one server, a will be delivered before b by all +servers. If a and b are delivered messages, either a will be delivered before b +or b will be delivered before a.</para></listitem></varlistentry> + +<varlistentry><term><emphasis >Causal order</emphasis> </term> + +<listitem><para> +If a message b is sent after a message a has been delivered by the sender of b, +a must be ordered before b. If a sender sends c after sending b, c must be ordered after b. +</para></listitem></varlistentry> + +</variablelist> + + +<para> +The ZooKeeper messaging system also needs to be efficient, reliable, and easy to +implement and maintain. We make heavy use of messaging, so we need the system to +be able to handle thousands of requests per second. Although we can require at +least k+1 correct servers to send new messages, we must be able to recover from +correlated failures such as power outages. When we implemented the system we had +little time and few engineering resources, so we needed a protocol that is +accessible to engineers and is easy to implement. We found that our protocol +satisfied all of these goals. + +</para> + +<para> +Our protocol assumes that we can construct point-to-point FIFO channels between +the servers. While similar services usually assume message delivery that can +lose or reorder messages, our assumption of FIFO channels is very practical +given that we use TCP for communication. Specifically we rely on the following property of TCP:</para> + +<variablelist> + +<varlistentry> +<term><emphasis >Ordered delivery</emphasis></term> +<listitem><para>Data is delivered in the same order it is sent and a message m is +delivered only after all messages sent before m have been delivered. +(The corollary to this is that if message m is lost all messages after m will be lost.)</para></listitem></varlistentry> + +<varlistentry><term><emphasis >No message after close</emphasis></term> +<listitem><para>Once a FIFO channel is closed, no messages will be received from it.</para></listitem></varlistentry> + +</variablelist> + +<para> +FLP proved that consensus cannot be achieved in asynchronous distributed systems +if failures are possible. To ensure we achieve consensus in the presence of failures +we use timeouts. However, we rely on times for liveness not for correctness. So, +if timeouts stop working (clocks malfunction for example) the messaging system may +hang, but it will not violate its guarantees.</para> + +<para>When describing the ZooKeeper messaging protocol we will talk of packets, +proposals, and messages:</para> +<variablelist> +<varlistentry><term><emphasis >Packet</emphasis></term> +<listitem><para>a sequence of bytes sent through a FIFO channel</para></listitem></varlistentry><varlistentry> + +<term><emphasis >Proposal</emphasis></term> +<listitem><para>a unit of agreement. Proposals are agreed upon by exchanging packets +with a quorum of ZooKeeper servers. Most proposals contain messages, however the +NEW_LEADER proposal is an example of a proposal that does not correspond to a message.</para></listitem> +</varlistentry><varlistentry> + +<term><emphasis >Message</emphasis></term> +<listitem><para>a sequence of bytes to be atomically broadcast to all ZooKeeper +servers. A message put into a proposal and agreed upon before it is delivered.</para></listitem> +</varlistentry> + +</variablelist> + +<para> +As stated above, ZooKeeper guarantees a total order of messages, and it also +guarantees a total order of proposals. ZooKeeper exposes the total ordering using +a ZooKeeper transaction id (<emphasis>zxid</emphasis>). All proposals will be stamped with a zxid when +it is proposed and exactly reflects the total ordering. Proposals are sent to all +ZooKeeper servers and committed when a quorum of them acknowledge the proposal. +If a proposal contains a message, the message will be delivered when the proposal +is committed. Acknowledgement means the server has recorded the proposal to persistent storage. +Our quorums have the requirement that any pair of quorum must have at least one server +in common. We ensure this by requiring that all quorums have size (<emphasis>n/2+1</emphasis>) where +n is the number of servers that make up a ZooKeeper service. +</para> + +<para> +The zxid has two parts: the epoch and a counter. In our implementation the zxid +is a 64-bit number. We use the high order 32-bits for the epoch and the low order +32-bits for the counter. Because it has two parts represent the zxid both as a +number and as a pair of integers, (<emphasis>epoch, count</emphasis>). The epoch number represents a +change in leadership. Each time a new leader comes into power it will have its +own epoch number. We have a simple algorithm to assign a unique zxid to a proposal: +the leader simply increments the zxid to obtain a unique zxid for each proposal. +<emphasis>Leadership activation will ensure that only one leader uses a given epoch, so our +simple algorithm guarantees that every proposal will have a unique id.</emphasis> +</para> + +<para> +ZooKeeper messaging consists of two phases:</para> + +<variablelist> +<varlistentry><term><emphasis >Leader activation</emphasis></term> +<listitem><para>In this phase a leader establishes the correct state of the system +and gets ready to start making proposals.</para></listitem> +</varlistentry> + +<varlistentry><term><emphasis >Active messaging</emphasis></term> +<listitem><para>In this phase a leader accepts messages to propose and coordinates message delivery.</para></listitem> +</varlistentry> +</variablelist> + +<para> +ZooKeeper is a holistic protocol. We do not focus on individual proposals, rather +look at the stream of proposals as a whole. Our strict ordering allows us to do this +efficiently and greatly simplifies our protocol. Leadership activation embodies +this holistic concept. A leader becomes active only when a quorum of followers +(The leader counts as a follower as well. You can always vote for yourself ) has synced +up with the leader, they have the same state. This state consists of all of the +proposals that the leader believes have been committed and the proposal to follow +the leader, the NEW_LEADER proposal. (Hopefully you are thinking to +yourself, <emphasis>Does the set of proposals that the leader believes has been committed +included all the proposals that really have been committed?</emphasis> The answer is <emphasis>yes</emphasis>. +Below, we make clear why.) +</para> + +</section> + +<section id="sc_leaderElection"> + +<title>Leader Activation</title> +<para> +Leader activation includes leader election. We currently have two leader election +algorithms in ZooKeeper: LeaderElection and FastLeaderElection (AuthFastLeaderElection +is a variant of FastLeaderElection that uses UDP and allows servers to perform a simple +form of authentication to avoid IP spoofing). ZooKeeper messaging doesn't care about the +exact method of electing a leader has long as the following holds: +</para> + +<itemizedlist> + +<listitem><para>The leader has seen the highest zxid of all the followers.</para></listitem> +<listitem><para>A quorum of servers have committed to following the leader.</para></listitem> + +</itemizedlist> + +<para> +Of these two requirements only the first, the highest zxid amoung the followers +needs to hold for correct operation. The second requirement, a quorum of followers, +just needs to hold with high probability. We are going to recheck the second requirement, +so if a failure happens during or after the leader election and quorum is lost, +we will recover by abandoning leader activation and running another election. +</para> + +<para> +After leader election a single server will be designated as a leader and start +waiting for followers to connect. The rest of the servers will try to connect to +the leader. The leader will sync up with followers by sending any proposals they +are missing, or if a follower is missing too many proposals, it will send a full +snapshot of the state to the follower. +</para> + +<para> +There is a corner case in which a follower that has proposals, U, not seen +by a leader arrives. Proposals are seen in order, so the proposals of U will have a zxids +higher than zxids seen by the leader. The follower must have arrived after the +leader election, otherwise the follower would have been elected leader given that +it has seen a higher zxid. Since committed proposals must be seen by a quorum of +servers, and a quorum of servers that elected the leader did not see U, the proposals +of you have not been committed, so they can be discarded. When the follower connects +to the leader, the leader will tell the follower to discard U. +</para> + +<para> +A new leader establishes a zxid to start using for new proposals by getting the +epoch, e, of the highest zxid it has seen and setting the next zxid to use to be +(e+1, 0), fter the leader syncs with a follower, it will propose a NEW_LEADER +proposal. Once the NEW_LEADER proposal has been committed, the leader will activate +and start receiving and issuing proposals. +</para> + +<para> +It all sounds complicated but here are the basic rules of operation during leader +activation: +</para> + +<itemizedlist> +<listitem><para>A follower will ACK the NEW_LEADER proposal after it has synced with the leader.</para></listitem> +<listitem><para>A follower will only ACK a NEW_LEADER proposal with a given zxid from a single server.</para></listitem> +<listitem><para>A new leader will COMMIT the NEW_LEADER proposal when a quorum of followers have ACKed it.</para></listitem> +<listitem><para>A follower will commit any state it received from the leader when the NEW_LEADER proposal is COMMIT.</para></listitem> +<listitem><para>A new leader will not accept new proposals until the NEW_LEADER proposal has been COMMITED.</para></listitem> +</itemizedlist> + +<para> +If leader election terminates erroneously, we don't have a problem since the +NEW_LEADER proposal will not be committed since the leader will not have quorum. +When this happens, the leader and any remaining followers will timeout and go back +to leader election. +</para> + +</section> + +<section id="sc_activeMessaging"> +<title>Active Messaging</title> +<para> +Leader Activation does all the heavy lifting. Once the leader is coronated he can +start blasting out proposals. As long as he remains the leader no other leader can +emerge since no other leader will be able to get a quorum of followers. If a new +leader does emerge, +it means that the leader has lost quorum, and the new leader will clean up any +mess left over during her leadership activation. +</para> + +<para>ZooKeeper messaging operates similar to a classic two-phase commit.</para> + +<mediaobject id="fg_2phaseCommit" > + <imageobject> + <imagedata fileref="images/2pc.jpg"/> + </imageobject> +</mediaobject> + +<para> +All communication channels are FIFO, so everything is done in order. Specifically +the following operating constraints are observed:</para> + +<itemizedlist> + +<listitem><para>The leader sends proposals to all followers using +the same order. Moreover, this order follows the order in which requests have been +received. Because we use FIFO channels this means that followers also receive proposals in order. +</para></listitem> + +<listitem><para>Followers process messages in the order they are received. This +means that messages will be ACKed in order and the leader will receive ACKs from +followers in order, due to the FIFO channels. It also means that if message $m$ +has been written to non-volatile storage, all messages that were proposed before +$m$ have been written to non-volatile storage.</para></listitem> + +<listitem><para>The leader will issue a COMMIT to all followers as soon as a +quorum of followers have ACKed a message. Since messages are ACKed in order, +COMMITs will be sent by the leader as received by the followers in order.</para></listitem> + +<listitem><para>COMMITs are processed in order. Followers deliver a proposals +message when that proposal is committed.</para></listitem> + +</itemizedlist> + +</section> + +<section id="sc_summary"> +<title>Summary</title> +<para>So there you go. Why does it work? Specifically, why does a set of proposals +believed by a new leader always contain any proposal that has actually been committed? +First, all proposals have a unique zxid, so unlike other protocols, we never have +to worry about two different values being proposed for the same zxid; followers +(a leader is also a follower) see and record proposals in order; proposals are +committed in order; there is only one active leader at a time since followers only +follow a single leader at a time; a new leader has seen all committed proposals +from the previous epoch since it has seen the highest zxid from a quorum of servers; +any uncommited proposals from a previous epoch seen by a new leader will be committed +by that leader before it becomes active.</para></section> + +<section id="sc_comparisons"><title>Comparisons</title> +<para> +Isn't this just Multi-Paxos? No, Multi-Paxos requires some way of assuring that +there is only a single coordinator. We do not count on such assurances. Instead +we use the leader activation to recover from leadership change or old leaders +believing they are still active. +</para> + +<para> +Isn't this just Paxos? Your active messaging phase looks just like phase 2 of Paxos? +Actually, to us active messaging looks just like 2 phase commit without the need to +handle aborts. Active messaging is different from both in the sense that it has +cross proposal ordering requirements. If we do not maintain strict FIFO ordering of +all packets, it all falls apart. Also, our leader activation phase is different from +both of them. In particular, our use of epochs allows us to skip blocks of uncommitted +proposals and to not worry about duplicate proposals for a given zxid. +</para> + +</section> + +</section> + +<section id="sc_quorum"> +<title>Quorums</title> + +<para> +Atomic broadcast and leader election use the notion of quorum to guarantee a consistent +view of the system. By default, ZooKeeper uses majority quorums, which means that every +voting that happens in one of these protocols requires a majority to vote on. One example is +acknowledging a leader proposal: the leader can only commit once it receives an +acknowledgement from a quorum of servers. +</para> + +<para> +If we extract the properties that we really need from our use of majorities, we have that we only +need to guarantee that groups of processes used to validate an operation by voting (e.g., acknowledging +a leader proposal) pairwise intersect in at least one server. Using majorities guarantees such a property. +However, there are other ways of constructing quorums different from majorities. For example, we can assign +weights to the votes of servers, and say that the votes of some servers are more important. To obtain a quorum, +we get enough votes so that the sum of weights of all votes is larger than half of the total sum of all weights. +</para> + +<para> +A different construction that uses weights and is useful in wide-area deployments (co-locations) is a hierarchical +one. With this construction, we split the servers into disjoint groups and assign weights to processes. To form +a quorum, we have to get a hold of enough servers from a majority of groups G, such that for each group g in G, +the sum of votes from g is larger than half of the sum of weights in g. Interestingly, this construction enables +smaller quorums. If we have, for example, 9 servers, we split them into 3 groups, and assign a weight of 1 to each +server, then we are able to form quorums of size 4. Note that two subsets of processes composed each of a majority +of servers from each of a majority of groups necessarily have a non-empty intersection. It is reasonable to expect +that a majority of co-locations will have a majority of servers available with high probability. +</para> + +<para> +With ZooKeeper, we provide a user with the ability of configuring servers to use majority quorums, weights, or a +hierarchy of groups. +</para> +</section> + +<section id="sc_logging"> + +<title>Logging</title> +<para> +Zookeeper uses +<ulink url="http://www.slf4j.org/index.html">slf4j</ulink> as an abstraction layer for logging. +<ulink url="http://logging.apache.org/log4j">log4j</ulink> in version 1.2 is chosen as the final logging implementation for now. +For better embedding support, it is planned in the future to leave the decision of choosing the final logging implementation to the end user. +Therefore, always use the slf4j api to write log statements in the code, but configure log4j for how to log at runtime. +Note that slf4j has no FATAL level, former messages at FATAL level have been moved to ERROR level. +For information on configuring log4j for +ZooKeeper, see the <ulink url="zookeeperAdmin.html#sc_logging">Logging</ulink> section +of the <ulink url="zookeeperAdmin.html">ZooKeeper Administrator's Guide.</ulink> + +</para> + +<section id="sc_developerGuidelines"><title>Developer Guidelines</title> + +<para>Please follow the +<ulink url="http://www.slf4j.org/manual.html">slf4j manual</ulink> when creating log statements within code. +Also read the +<ulink url="http://www.slf4j.org/faq.html#logging_performance">FAQ on performance</ulink> +, when creating log statements. Patch reviewers will look for the following:</para> +<section id="sc_rightLevel"><title>Logging at the Right Level</title> +<para> +There are several levels of logging in slf4j. +It's important to pick the right one. In order of higher to lower severity:</para> +<orderedlist> + <listitem><para>ERROR level designates error events that might still allow the application to continue running.</para></listitem> + <listitem><para>WARN level designates potentially harmful situations.</para></listitem> + <listitem><para>INFO level designates informational messages that highlight the progress of the application at coarse-grained level.</para></listitem> + <listitem><para>DEBUG Level designates fine-grained informational events that are most useful to debug an application.</para></listitem> + <listitem><para>TRACE Level designates finer-grained informational events than the DEBUG.</para></listitem> +</orderedlist> + +<para> +ZooKeeper is typically run in production such that log messages of INFO level +severity and higher (more severe) are output to the log.</para> + + +</section> + +<section id="sc_slf4jIdioms"><title>Use of Standard slf4j Idioms</title> + +<para><emphasis>Static Message Logging</emphasis></para> +<programlisting> +LOG.debug("process completed successfully!"); +</programlisting> + +<para> +However when creating parameterized messages are required, use formatting anchors. +</para> + +<programlisting> +LOG.debug("got {} messages in {} minutes",new Object[]{count,time}); +</programlisting> + + +<para><emphasis>Naming</emphasis></para> + +<para> +Loggers should be named after the class in which they are used. +</para> + +<programlisting> +public class Foo { + private static final Logger LOG = LoggerFactory.getLogger(Foo.class); + .... + public Foo() { + LOG.info("constructing Foo"); +</programlisting> + +<para><emphasis>Exception handling</emphasis></para> +<programlisting> +try { + // code +} catch (XYZException e) { + // do this + LOG.error("Something bad happened", e); + // don't do this (generally) + // LOG.error(e); + // why? because "don't do" case hides the stack trace + + // continue process here as you need... recover or (re)throw +} +</programlisting> +</section> +</section> + +</section> + +</article> http://git-wip-us.apache.org/repos/asf/zookeeper/blob/4607a3e1/zookeeper-docs/src/documentation/content/xdocs/zookeeperJMX.xml ---------------------------------------------------------------------- diff --git a/zookeeper-docs/src/documentation/content/xdocs/zookeeperJMX.xml b/zookeeper-docs/src/documentation/content/xdocs/zookeeperJMX.xml new file mode 100644 index 0000000..f0ea636 --- /dev/null +++ b/zookeeper-docs/src/documentation/content/xdocs/zookeeperJMX.xml @@ -0,0 +1,236 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!-- + Copyright 2002-2004 The Apache Software Foundation + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +<!DOCTYPE article PUBLIC "-//OASIS//DTD Simplified DocBook XML V1.0//EN" +"http://www.oasis-open.org/docbook/xml/simple/1.0/sdocbook.dtd"> +<article id="bk_zookeeperjmx"> + <title>ZooKeeper JMX</title> + + <articleinfo> + <legalnotice> + <para>Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. You may + obtain a copy of the License at <ulink + url="http://www.apache.org/licenses/LICENSE-2.0">http://www.apache.org/licenses/LICENSE-2.0</ulink>.</para> + + <para>Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an "AS IS" + BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied. See the License for the specific language governing permissions + and limitations under the License.</para> + </legalnotice> + + <abstract> + <para>ZooKeeper support for JMX</para> + </abstract> + </articleinfo> + + <section id="ch_jmx"> + <title>JMX</title> + <para>Apache ZooKeeper has extensive support for JMX, allowing you + to view and manage a ZooKeeper serving ensemble.</para> + + <para>This document assumes that you have basic knowledge of + JMX. See <ulink + url="http://java.sun.com/javase/technologies/core/mntr-mgmt/javamanagement/"> + Sun JMX Technology</ulink> page to get started with JMX. + </para> + + <para>See the <ulink + url="http://java.sun.com/javase/6/docs/technotes/guides/management/agent.html"> + JMX Management Guide</ulink> for details on setting up local and + remote management of VM instances. By default the included + <emphasis>zkServer.sh</emphasis> supports only local management - + review the linked document to enable support for remote management + (beyond the scope of this document). + </para> + + </section> + + <section id="ch_starting"> + <title>Starting ZooKeeper with JMX enabled</title> + + <para>The class + <emphasis>org.apache.zookeeper.server.quorum.QuorumPeerMain</emphasis> + will start a JMX manageable ZooKeeper server. This class + registers the proper MBeans during initalization to support JMX + monitoring and management of the + instance. See <emphasis>bin/zkServer.sh</emphasis> for one + example of starting ZooKeeper using QuorumPeerMain.</para> + </section> + + <section id="ch_console"> + <title>Run a JMX console</title> + + <para>There are a number of JMX consoles available which can connect + to the running server. For this example we will use Sun's + <emphasis>jconsole</emphasis>.</para> + + <para>The Java JDK ships with a simple JMX console + named <ulink url="http://java.sun.com/developer/technicalArticles/J2SE/jconsole.html">jconsole</ulink> + which can be used to connect to ZooKeeper and inspect a running + server. Once you've started ZooKeeper using QuorumPeerMain + start <emphasis>jconsole</emphasis>, which typically resides in + <emphasis>JDK_HOME/bin/jconsole</emphasis></para> + + <para>When the "new connection" window is displayed either connect + to local process (if jconsole started on same host as Server) or + use the remote process connection.</para> + + <para>By default the "overview" tab for the VM is displayed (this + is a great way to get insight into the VM btw). Select + the "MBeans" tab.</para> + + <para>You should now see <emphasis>org.apache.ZooKeeperService</emphasis> + on the left hand side. Expand this item and depending on how you've + started the server you will be able to monitor and manage various + service related features.</para> + + <para>Also note that ZooKeeper will register log4j MBeans as + well. In the same section along the left hand side you will see + "log4j". Expand that to manage log4j through JMX. Of particular + interest is the ability to dynamically change the logging levels + used by editing the appender and root thresholds. Log4j MBean + registration can be disabled by passing + <emphasis>-Dzookeeper.jmx.log4j.disable=true</emphasis> to the JVM + when starting ZooKeeper. + </para> + + </section> + + <section id="ch_reference"> + <title>ZooKeeper MBean Reference</title> + + <para>This table details JMX for a server participating in a + replicated ZooKeeper ensemble (ie not standalone). This is the + typical case for a production environment.</para> + + <table> + <title>MBeans, their names and description</title> + + <tgroup cols='4'> + <thead> + <row> + <entry>MBean</entry> + <entry>MBean Object Name</entry> + <entry>Description</entry> + </row> + </thead> + <tbody> + <row> + <entry>Quorum</entry> + <entry>ReplicatedServer_id<#></entry> + <entry>Represents the Quorum, or Ensemble - parent of all + cluster members. Note that the object name includes the + "myid" of the server (name suffix) that your JMX agent has + connected to.</entry> + </row> + <row> + <entry>LocalPeer|RemotePeer</entry> + <entry>replica.<#></entry> + <entry>Represents a local or remote peer (ie server + participating in the ensemble). Note that the object name + includes the "myid" of the server (name suffix).</entry> + </row> + <row> + <entry>LeaderElection</entry> + <entry>LeaderElection</entry> + <entry>Represents a ZooKeeper cluster leader election which is + in progress. Provides information about the election, such as + when it started.</entry> + </row> + <row> + <entry>Leader</entry> + <entry>Leader</entry> + <entry>Indicates that the parent replica is the leader and + provides attributes/operations for that server. Note that + Leader is a subclass of ZooKeeperServer, so it provides + all of the information normally associated with a + ZooKeeperServer node.</entry> + </row> + <row> + <entry>Follower</entry> + <entry>Follower</entry> + <entry>Indicates that the parent replica is a follower and + provides attributes/operations for that server. Note that + Follower is a subclass of ZooKeeperServer, so it provides + all of the information normally associated with a + ZooKeeperServer node.</entry> + </row> + <row> + <entry>DataTree</entry> + <entry>InMemoryDataTree</entry> + <entry>Statistics on the in memory znode database, also + operations to access finer (and more computationally + intensive) statistics on the data (such as ephemeral + count). InMemoryDataTrees are children of ZooKeeperServer + nodes.</entry> + </row> + <row> + <entry>ServerCnxn</entry> + <entry><session_id></entry> + <entry>Statistics on each client connection, also + operations on those connections (such as + termination). Note the object name is the session id of + the connection in hex form.</entry> + </row> + </tbody></tgroup></table> + + <para>This table details JMX for a standalone server. Typically + standalone is only used in development situations.</para> + + <table> + <title>MBeans, their names and description</title> + + <tgroup cols='4'> + <thead> + <row> + <entry>MBean</entry> + <entry>MBean Object Name</entry> + <entry>Description</entry> + </row> + </thead> + <tbody> + <row> + <entry>ZooKeeperServer</entry> + <entry>StandaloneServer_port<#></entry> + <entry>Statistics on the running server, also operations + to reset these attributes. Note that the object name + includes the client port of the server (name + suffix).</entry> + </row> + <row> + <entry>DataTree</entry> + <entry>InMemoryDataTree</entry> + <entry>Statistics on the in memory znode database, also + operations to access finer (and more computationally + intensive) statistics on the data (such as ephemeral + count).</entry> + </row> + <row> + <entry>ServerCnxn</entry> + <entry><session_id></entry> + <entry>Statistics on each client connection, also + operations on those connections (such as + termination). Note the object name is the session id of + the connection in hex form.</entry> + </row> + </tbody></tgroup></table> + + </section> + +</article> http://git-wip-us.apache.org/repos/asf/zookeeper/blob/4607a3e1/zookeeper-docs/src/documentation/content/xdocs/zookeeperObservers.xml ---------------------------------------------------------------------- diff --git a/zookeeper-docs/src/documentation/content/xdocs/zookeeperObservers.xml b/zookeeper-docs/src/documentation/content/xdocs/zookeeperObservers.xml new file mode 100644 index 0000000..fab6769 --- /dev/null +++ b/zookeeper-docs/src/documentation/content/xdocs/zookeeperObservers.xml @@ -0,0 +1,145 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!-- + Copyright 2002-2004 The Apache Software Foundation + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +<!DOCTYPE article PUBLIC "-//OASIS//DTD Simplified DocBook XML V1.0//EN" +"http://www.oasis-open.org/docbook/xml/simple/1.0/sdocbook.dtd"> +<article id="bk_GettStartedGuide"> + <title>ZooKeeper Observers</title> + + <articleinfo> + <legalnotice> + <para>Licensed under the Apache License, Version 2.0 (the "License"); you + may not use this file except in compliance with the License. You may + obtain a copy of the License + at <ulink url="http://www.apache.org/licenses/LICENSE-2.0">http://www.apache.org/licenses/LICENSE-2.0</ulink>.</para> + + <para>Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, WITHOUT + WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the + License for the specific language governing permissions and limitations + under the License.</para> + </legalnotice> + + <abstract> + <para>This guide contains information about using non-voting servers, or + observers in your ZooKeeper ensembles.</para> + </abstract> + </articleinfo> + + <section id="ch_Introduction"> + <title>Observers: Scaling ZooKeeper Without Hurting Write Performance + </title> + <para> + Although ZooKeeper performs very well by having clients connect directly + to voting members of the ensemble, this architecture makes it hard to + scale out to huge numbers of clients. The problem is that as we add more + voting members, the write performance drops. This is due to the fact that + a write operation requires the agreement of (in general) at least half the + nodes in an ensemble and therefore the cost of a vote can increase + significantly as more voters are added. + </para> + <para> + We have introduced a new type of ZooKeeper node called + an <emphasis>Observer</emphasis> which helps address this problem and + further improves ZooKeeper's scalability. Observers are non-voting members + of an ensemble which only hear the results of votes, not the agreement + protocol that leads up to them. Other than this simple distinction, + Observers function exactly the same as Followers - clients may connect to + them and send read and write requests to them. Observers forward these + requests to the Leader like Followers do, but they then simply wait to + hear the result of the vote. Because of this, we can increase the number + of Observers as much as we like without harming the performance of votes. + </para> + <para> + Observers have other advantages. Because they do not vote, they are not a + critical part of the ZooKeeper ensemble. Therefore they can fail, or be + disconnected from the cluster, without harming the availability of the + ZooKeeper service. The benefit to the user is that Observers may connect + over less reliable network links than Followers. In fact, Observers may be + used to talk to a ZooKeeper server from another data center. Clients of + the Observer will see fast reads, as all reads are served locally, and + writes result in minimal network traffic as the number of messages + required in the absence of the vote protocol is smaller. + </para> + </section> + <section id="sc_UsingObservers"> + <title>How to use Observers</title> + <para>Setting up a ZooKeeper ensemble that uses Observers is very simple, + and requires just two changes to your config files. Firstly, in the config + file of every node that is to be an Observer, you must place this line: + </para> + <programlisting> + peerType=observer + </programlisting> + + <para> + This line tells ZooKeeper that the server is to be an Observer. Secondly, + in every server config file, you must add :observer to the server + definition line of each Observer. For example: + </para> + + <programlisting> + server.1:localhost:2181:3181:observer + </programlisting> + + <para> + This tells every other server that server.1 is an Observer, and that they + should not expect it to vote. This is all the configuration you need to do + to add an Observer to your ZooKeeper cluster. Now you can connect to it as + though it were an ordinary Follower. Try it out, by running:</para> + <programlisting> + $ bin/zkCli.sh -server localhost:2181 + </programlisting> + <para> + where localhost:2181 is the hostname and port number of the Observer as + specified in every config file. You should see a command line prompt + through which you can issue commands like <emphasis>ls</emphasis> to query + the ZooKeeper service. + </para> + </section> + + <section id="ch_UseCases"> + <title>Example use cases</title> + <para> + Two example use cases for Observers are listed below. In fact, wherever + you wish to scale the number of clients of your ZooKeeper ensemble, or + where you wish to insulate the critical part of an ensemble from the load + of dealing with client requests, Observers are a good architectural + choice. + </para> + <itemizedlist> + <listitem> + <para> As a datacenter bridge: Forming a ZK ensemble between two + datacenters is a problematic endeavour as the high variance in latency + between the datacenters could lead to false positive failure detection + and partitioning. However if the ensemble runs entirely in one + datacenter, and the second datacenter runs only Observers, partitions + aren't problematic as the ensemble remains connected. Clients of the + Observers may still see and issue proposals.</para> + </listitem> + <listitem> + <para>As a link to a message bus: Some companies have expressed an + interest in using ZK as a component of a persistent reliable message + bus. Observers would give a natural integration point for this work: a + plug-in mechanism could be used to attach the stream of proposals an + Observer sees to a publish-subscribe system, again without loading the + core ensemble. + </para> + </listitem> + </itemizedlist> + </section> +</article> http://git-wip-us.apache.org/repos/asf/zookeeper/blob/4607a3e1/zookeeper-docs/src/documentation/content/xdocs/zookeeperOtherInfo.xml ---------------------------------------------------------------------- diff --git a/zookeeper-docs/src/documentation/content/xdocs/zookeeperOtherInfo.xml b/zookeeper-docs/src/documentation/content/xdocs/zookeeperOtherInfo.xml new file mode 100644 index 0000000..a2445b1 --- /dev/null +++ b/zookeeper-docs/src/documentation/content/xdocs/zookeeperOtherInfo.xml @@ -0,0 +1,46 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!-- + Copyright 2002-2004 The Apache Software Foundation + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +<!DOCTYPE article PUBLIC "-//OASIS//DTD Simplified DocBook XML V1.0//EN" +"http://www.oasis-open.org/docbook/xml/simple/1.0/sdocbook.dtd"> +<article id="bk_OtherInfo"> + <title>ZooKeeper</title> + + <articleinfo> + <legalnotice> + <para>Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. You may + obtain a copy of the License at <ulink + url="http://www.apache.org/licenses/LICENSE-2.0">http://www.apache.org/licenses/LICENSE-2.0</ulink>.</para> + + <para>Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an "AS IS" + BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied. See the License for the specific language governing permissions + and limitations under the License.</para> + </legalnotice> + + <abstract> + <para> currently empty </para> + </abstract> + </articleinfo> + + <section id="ch_placeholder"> + <title>Other Info</title> + <para> currently empty </para> + </section> +</article> http://git-wip-us.apache.org/repos/asf/zookeeper/blob/4607a3e1/zookeeper-docs/src/documentation/content/xdocs/zookeeperOver.xml ---------------------------------------------------------------------- diff --git a/zookeeper-docs/src/documentation/content/xdocs/zookeeperOver.xml b/zookeeper-docs/src/documentation/content/xdocs/zookeeperOver.xml new file mode 100644 index 0000000..f972657 --- /dev/null +++ b/zookeeper-docs/src/documentation/content/xdocs/zookeeperOver.xml @@ -0,0 +1,464 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!-- + Copyright 2002-2004 The Apache Software Foundation + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +<!DOCTYPE article PUBLIC "-//OASIS//DTD Simplified DocBook XML V1.0//EN" +"http://www.oasis-open.org/docbook/xml/simple/1.0/sdocbook.dtd"> +<article id="bk_Overview"> + <title>ZooKeeper</title> + + <articleinfo> + <legalnotice> + <para>Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. You may + obtain a copy of the License at <ulink + url="http://www.apache.org/licenses/LICENSE-2.0">http://www.apache.org/licenses/LICENSE-2.0</ulink>.</para> + + <para>Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an "AS IS" + BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied. See the License for the specific language governing permissions + and limitations under the License.</para> + </legalnotice> + + <abstract> + <para>This document contains overview information about ZooKeeper. It + discusses design goals, key concepts, implementation, and + performance.</para> + </abstract> + </articleinfo> + + <section id="ch_DesignOverview"> + <title>ZooKeeper: A Distributed Coordination Service for Distributed + Applications</title> + + <para>ZooKeeper is a distributed, open-source coordination service for + distributed applications. It exposes a simple set of primitives that + distributed applications can build upon to implement higher level services + for synchronization, configuration maintenance, and groups and naming. It + is designed to be easy to program to, and uses a data model styled after + the familiar directory tree structure of file systems. It runs in Java and + has bindings for both Java and C.</para> + + <para>Coordination services are notoriously hard to get right. They are + especially prone to errors such as race conditions and deadlock. The + motivation behind ZooKeeper is to relieve distributed applications the + responsibility of implementing coordination services from scratch.</para> + + <section id="sc_designGoals"> + <title>Design Goals</title> + + <para><emphasis role="bold">ZooKeeper is simple.</emphasis> ZooKeeper + allows distributed processes to coordinate with each other through a + shared hierarchal namespace which is organized similarly to a standard + file system. The name space consists of data registers - called znodes, + in ZooKeeper parlance - and these are similar to files and directories. + Unlike a typical file system, which is designed for storage, ZooKeeper + data is kept in-memory, which means ZooKeeper can achieve high + throughput and low latency numbers.</para> + + <para>The ZooKeeper implementation puts a premium on high performance, + highly available, strictly ordered access. The performance aspects of + ZooKeeper means it can be used in large, distributed systems. The + reliability aspects keep it from being a single point of failure. The + strict ordering means that sophisticated synchronization primitives can + be implemented at the client.</para> + + <para><emphasis role="bold">ZooKeeper is replicated.</emphasis> Like the + distributed processes it coordinates, ZooKeeper itself is intended to be + replicated over a sets of hosts called an ensemble.</para> + + <figure> + <title>ZooKeeper Service</title> + + <mediaobject> + <imageobject> + <imagedata fileref="images/zkservice.jpg" /> + </imageobject> + </mediaobject> + </figure> + + <para>The servers that make up the ZooKeeper service must all know about + each other. They maintain an in-memory image of state, along with a + transaction logs and snapshots in a persistent store. As long as a + majority of the servers are available, the ZooKeeper service will be + available.</para> + + <para>Clients connect to a single ZooKeeper server. The client maintains + a TCP connection through which it sends requests, gets responses, gets + watch events, and sends heart beats. If the TCP connection to the server + breaks, the client will connect to a different server.</para> + + <para><emphasis role="bold">ZooKeeper is ordered.</emphasis> ZooKeeper + stamps each update with a number that reflects the order of all + ZooKeeper transactions. Subsequent operations can use the order to + implement higher-level abstractions, such as synchronization + primitives.</para> + + <para><emphasis role="bold">ZooKeeper is fast.</emphasis> It is + especially fast in "read-dominant" workloads. ZooKeeper applications run + on thousands of machines, and it performs best where reads are more + common than writes, at ratios of around 10:1.</para> + </section> + + <section id="sc_dataModelNameSpace"> + <title>Data model and the hierarchical namespace</title> + + <para>The name space provided by ZooKeeper is much like that of a + standard file system. A name is a sequence of path elements separated by + a slash (/). Every node in ZooKeeper's name space is identified by a + path.</para> + + <figure> + <title>ZooKeeper's Hierarchical Namespace</title> + + <mediaobject> + <imageobject> + <imagedata fileref="images/zknamespace.jpg" /> + </imageobject> + </mediaobject> + </figure> + </section> + + <section> + <title>Nodes and ephemeral nodes</title> + + <para>Unlike standard file systems, each node in a ZooKeeper + namespace can have data associated with it as well as children. It is + like having a file-system that allows a file to also be a directory. + (ZooKeeper was designed to store coordination data: status information, + configuration, location information, etc., so the data stored at each + node is usually small, in the byte to kilobyte range.) We use the term + <emphasis>znode</emphasis> to make it clear that we are talking about + ZooKeeper data nodes.</para> + + <para>Znodes maintain a stat structure that includes version numbers for + data changes, ACL changes, and timestamps, to allow cache validations + and coordinated updates. Each time a znode's data changes, the version + number increases. For instance, whenever a client retrieves data it also + receives the version of the data.</para> + + <para>The data stored at each znode in a namespace is read and written + atomically. Reads get all the data bytes associated with a znode and a + write replaces all the data. Each node has an Access Control List (ACL) + that restricts who can do what.</para> + + <para>ZooKeeper also has the notion of ephemeral nodes. These znodes + exists as long as the session that created the znode is active. When the + session ends the znode is deleted. Ephemeral nodes are useful when you + want to implement <emphasis>[tbd]</emphasis>.</para> + </section> + + <section> + <title>Conditional updates and watches</title> + + <para>ZooKeeper supports the concept of <emphasis>watches</emphasis>. + Clients can set a watch on a znode. A watch will be triggered and + removed when the znode changes. When a watch is triggered, the client + receives a packet saying that the znode has changed. If the + connection between the client and one of the Zoo Keeper servers is + broken, the client will receive a local notification. These can be used + to <emphasis>[tbd]</emphasis>.</para> + </section> + + <section> + <title>Guarantees</title> + + <para>ZooKeeper is very fast and very simple. Since its goal, though, is + to be a basis for the construction of more complicated services, such as + synchronization, it provides a set of guarantees. These are:</para> + + <itemizedlist> + <listitem> + <para>Sequential Consistency - Updates from a client will be applied + in the order that they were sent.</para> + </listitem> + + <listitem> + <para>Atomicity - Updates either succeed or fail. No partial + results.</para> + </listitem> + + <listitem> + <para>Single System Image - A client will see the same view of the + service regardless of the server that it connects to.</para> + </listitem> + </itemizedlist> + + <itemizedlist> + <listitem> + <para>Reliability - Once an update has been applied, it will persist + from that time forward until a client overwrites the update.</para> + </listitem> + </itemizedlist> + + <itemizedlist> + <listitem> + <para>Timeliness - The clients view of the system is guaranteed to + be up-to-date within a certain time bound.</para> + </listitem> + </itemizedlist> + + <para>For more information on these, and how they can be used, see + <emphasis>[tbd]</emphasis></para> + </section> + + <section> + <title>Simple API</title> + + <para>One of the design goals of ZooKeeper is provide a very simple + programming interface. As a result, it supports only these + operations:</para> + + <variablelist> + <varlistentry> + <term>create</term> + + <listitem> + <para>creates a node at a location in the tree</para> + </listitem> + </varlistentry> + + <varlistentry> + <term>delete</term> + + <listitem> + <para>deletes a node</para> + </listitem> + </varlistentry> + + <varlistentry> + <term>exists</term> + + <listitem> + <para>tests if a node exists at a location</para> + </listitem> + </varlistentry> + + <varlistentry> + <term>get data</term> + + <listitem> + <para>reads the data from a node</para> + </listitem> + </varlistentry> + + <varlistentry> + <term>set data</term> + + <listitem> + <para>writes data to a node</para> + </listitem> + </varlistentry> + + <varlistentry> + <term>get children</term> + + <listitem> + <para>retrieves a list of children of a node</para> + </listitem> + </varlistentry> + + <varlistentry> + <term>sync</term> + + <listitem> + <para>waits for data to be propagated</para> + </listitem> + </varlistentry> + </variablelist> + + <para>For a more in-depth discussion on these, and how they can be used + to implement higher level operations, please refer to + <emphasis>[tbd]</emphasis></para> + </section> + + <section> + <title>Implementation</title> + + <para><xref linkend="fg_zkComponents" /> shows the high-level components + of the ZooKeeper service. With the exception of the request processor, + each of + the servers that make up the ZooKeeper service replicates its own copy + of each of the components.</para> + + <figure id="fg_zkComponents"> + <title>ZooKeeper Components</title> + + <mediaobject> + <imageobject> + <imagedata fileref="images/zkcomponents.jpg" /> + </imageobject> + </mediaobject> + </figure> + + <para>The replicated database is an in-memory database containing the + entire data tree. Updates are logged to disk for recoverability, and + writes are serialized to disk before they are applied to the in-memory + database.</para> + + <para>Every ZooKeeper server services clients. Clients connect to + exactly one server to submit irequests. Read requests are serviced from + the local replica of each server database. Requests that change the + state of the service, write requests, are processed by an agreement + protocol.</para> + + <para>As part of the agreement protocol all write requests from clients + are forwarded to a single server, called the + <emphasis>leader</emphasis>. The rest of the ZooKeeper servers, called + <emphasis>followers</emphasis>, receive message proposals from the + leader and agree upon message delivery. The messaging layer takes care + of replacing leaders on failures and syncing followers with + leaders.</para> + + <para>ZooKeeper uses a custom atomic messaging protocol. Since the + messaging layer is atomic, ZooKeeper can guarantee that the local + replicas never diverge. When the leader receives a write request, it + calculates what the state of the system is when the write is to be + applied and transforms this into a transaction that captures this new + state.</para> + </section> + + <section> + <title>Uses</title> + + <para>The programming interface to ZooKeeper is deliberately simple. + With it, however, you can implement higher order operations, such as + synchronizations primitives, group membership, ownership, etc. Some + distributed applications have used it to: <emphasis>[tbd: add uses from + white paper and video presentation.]</emphasis> For more information, see + <emphasis>[tbd]</emphasis></para> + </section> + + <section> + <title>Performance</title> + + <para>ZooKeeper is designed to be highly performant. But is it? The + results of the ZooKeeper's development team at Yahoo! Research indicate + that it is. (See <xref linkend="fg_zkPerfRW" />.) It is especially high + performance in applications where reads outnumber writes, since writes + involve synchronizing the state of all servers. (Reads outnumbering + writes is typically the case for a coordination service.)</para> + + <figure id="fg_zkPerfRW"> + <title>ZooKeeper Throughput as the Read-Write Ratio Varies</title> + + <mediaobject> + <imageobject> + <imagedata fileref="images/zkperfRW-3.2.jpg" /> + </imageobject> + </mediaobject> + </figure> + <para>The figure <xref linkend="fg_zkPerfRW"/> is a throughput + graph of ZooKeeper release 3.2 running on servers with dual 2Ghz + Xeon and two SATA 15K RPM drives. One drive was used as a + dedicated ZooKeeper log device. The snapshots were written to + the OS drive. Write requests were 1K writes and the reads were + 1K reads. "Servers" indicate the size of the ZooKeeper + ensemble, the number of servers that make up the + service. Approximately 30 other servers were used to simulate + the clients. The ZooKeeper ensemble was configured such that + leaders do not allow connections from clients.</para> + + <note><para>In version 3.2 r/w performance improved by ~2x + compared to the <ulink + url="http://zookeeper.apache.org/docs/r3.1.1/zookeeperOver.html#Performance">previous + 3.1 release</ulink>.</para></note> + + <para>Benchmarks also indicate that it is reliable, too. <xref + linkend="fg_zkPerfReliability" /> shows how a deployment responds to + various failures. The events marked in the figure are the + following:</para> + + <orderedlist> + <listitem> + <para>Failure and recovery of a follower</para> + </listitem> + + <listitem> + <para>Failure and recovery of a different follower</para> + </listitem> + + <listitem> + <para>Failure of the leader</para> + </listitem> + + <listitem> + <para>Failure and recovery of two followers</para> + </listitem> + + <listitem> + <para>Failure of another leader</para> + </listitem> + </orderedlist> + </section> + + <section> + <title>Reliability</title> + + <para>To show the behavior of the system over time as + failures are injected we ran a ZooKeeper service made up of + 7 machines. We ran the same saturation benchmark as before, + but this time we kept the write percentage at a constant + 30%, which is a conservative ratio of our expected + workloads. + </para> + <figure id="fg_zkPerfReliability"> + <title>Reliability in the Presence of Errors</title> + <mediaobject> + <imageobject> + <imagedata fileref="images/zkperfreliability.jpg" /> + </imageobject> + </mediaobject> + </figure> + + <para>The are a few important observations from this graph. First, if + followers fail and recover quickly, then ZooKeeper is able to sustain a + high throughput despite the failure. But maybe more importantly, the + leader election algorithm allows for the system to recover fast enough + to prevent throughput from dropping substantially. In our observations, + ZooKeeper takes less than 200ms to elect a new leader. Third, as + followers recover, ZooKeeper is able to raise throughput again once they + start processing requests.</para> + </section> + + <section> + <title>The ZooKeeper Project</title> + + <para>ZooKeeper has been + <ulink url="https://cwiki.apache.org/confluence/display/ZOOKEEPER/PoweredBy"> + successfully used + </ulink> + in many industrial applications. It is used at Yahoo! as the + coordination and failure recovery service for Yahoo! Message + Broker, which is a highly scalable publish-subscribe system + managing thousands of topics for replication and data + delivery. It is used by the Fetching Service for Yahoo! + crawler, where it also manages failure recovery. A number of + Yahoo! advertising systems also use ZooKeeper to implement + reliable services. + </para> + + <para>All users and developers are encouraged to join the + community and contribute their expertise. See the + <ulink url="http://zookeeper.apache.org/"> + Zookeeper Project on Apache + </ulink> + for more information. + </para> + </section> + </section> +</article>
