This is an automated email from the ASF dual-hosted git repository.
hanm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/zookeeper.git
The following commit(s) were added to refs/heads/master by this push:
new 650aea3 ZOOKEEPER-3522: Consistency guarantees discussion.
650aea3 is described below
commit 650aea33eddef9054c90ca47183cde0e58cabcd9
Author: Karolos Antoniadis <[email protected]>
AuthorDate: Wed Aug 28 14:03:12 2019 -0700
ZOOKEEPER-3522: Consistency guarantees discussion.
There seems to be some confusion regarding the exact consistency guarantees
that ZooKeeper provides. For example, does it provide linearizable reads?
After the recent discussion in the [dev mailing
list](https://mail-archives.apache.org/mod_mbox/zookeeper-dev/201908.mbox/<CAO%3DK-y1r4FYEtDsQyeVc5poPw6EnzVbMDzTgMv9tcrggMX8AbQ%40mail.gmail.com>),
I tried to clarify this in the ZooKeeper documentation.
Author: Karolos Antoniadis <[email protected]>
Reviewers: Alexander Shraer <[email protected]>, maoling
<[email protected]>, Michael Han <[email protected]>
Closes #1063 from insumity/ZOOKEEPER-3522
---
.../main/resources/markdown/zookeeperInternals.md | 86 +++++++++++++---------
1 file changed, 52 insertions(+), 34 deletions(-)
diff --git a/zookeeper-docs/src/main/resources/markdown/zookeeperInternals.md
b/zookeeper-docs/src/main/resources/markdown/zookeeperInternals.md
index a3a5673..25d9be8 100644
--- a/zookeeper-docs/src/main/resources/markdown/zookeeperInternals.md
+++ b/zookeeper-docs/src/main/resources/markdown/zookeeperInternals.md
@@ -23,6 +23,7 @@ limitations under the License.
* [Active Messaging](#sc_activeMessaging)
* [Summary](#sc_summary)
* [Comparisons](#sc_comparisons)
+* [Consistency Guarantees](#sc_consistency)
* [Quorums](#sc_quorum)
* [Logging](#sc_logging)
* [Developer Guidelines](#sc_developerGuidelines)
@@ -34,9 +35,11 @@ limitations under the License.
## Introduction
This document contains information on the inner workings of ZooKeeper.
-So far, it discusses these topics:
+It discusses the following topics:
* [Atomic Broadcast](#sc_atomicBroadcast)
+* [Consistency Guarantees](#sc_consistency)
+* [Quorums](#sc_quorum)
* [Logging](#sc_logging)
<a name="sc_atomicBroadcast"></a>
@@ -52,18 +55,17 @@ At the heart of ZooKeeper is an atomic messaging system
that keeps all of the se
The specific guarantees provided by the messaging system used by ZooKeeper are
the following:
* *_Reliable delivery_* :
- If a message, m, is delivered
- by one server, it will be eventually delivered by all servers.
+ If a message `m`, is delivered
+ by one server, message `m` will be eventually delivered by all servers.
* *_Total order_* :
- If a message is
- delivered before message b by one server, a will be delivered before b by
all
- servers. If a and b are delivered messages, either a will be delivered
before b
- or b will be delivered before a.
+ If a message `a` is
+ delivered before message `b` by one server, message `a` will be delivered
before `b` by all
+ servers.
* *_Causal order_* :
- If a message b is sent after a message a has been delivered by the sender
of b,
- a must be ordered before b. If a sender sends c after sending b, c must be
ordered after b.
+ If a message `b` is sent after a message `a` has been delivered by the
sender of `b`,
+ message `a` must be ordered before `b`. If a sender sends `c` after
sending `b`, `c` must be ordered after `b`.
The ZooKeeper messaging system also needs to be efficient, reliable, and easy
to
implement and maintain. We make heavy use of messaging, so we need the system
to
@@ -80,29 +82,29 @@ lose or reorder messages, our assumption of FIFO channels
is very practical
given that we use TCP for communication. Specifically we rely on the following
property of TCP:
* *_Ordered delivery_* :
- Data is delivered in the same order it is sent and a message m is
- delivered only after all messages sent before m have been delivered.
- (The corollary to this is that if message m is lost all messages after m
will be lost.)
+ Data is delivered in the same order it is sent and a message `m` is
+ delivered only after all messages sent before `m` have been delivered.
+ (The corollary to this is that if message `m` is lost all messages after
`m` will be lost.)
* *_No message after close_* :
Once a FIFO channel is closed, no messages will be received from it.
FLP proved that consensus cannot be achieved in asynchronous distributed
systems
-if failures are possible. To ensure we achieve consensus in the presence of
failures
-we use timeouts. However, we rely on times for liveness not for correctness.
So,
-if timeouts stop working (clocks malfunction for example) the messaging system
may
+if failures are possible. To ensure that we achieve consensus in the presence
of failures
+we use timeouts. However, we rely on time for liveness not for correctness. So,
+if timeouts stop working (e.g., skewed clocks) the messaging system may
hang, but it will not violate its guarantees.
When describing the ZooKeeper messaging protocol we will talk of packets,
proposals, and messages:
* *_Packet_* :
- a sequence of bytes sent through a FIFO channel
+ a sequence of bytes sent through a FIFO channel.
* *_Proposal_* :
a unit of agreement. Proposals are agreed upon by exchanging packets
with a quorum of ZooKeeper servers. Most proposals contain messages,
however the
- NEW_LEADER proposal is an example of a proposal that does not correspond
to a message.
+ NEW_LEADER proposal is an example of a proposal that does not contain to a
message.
* *_Message_* :
a sequence of bytes to be atomically broadcast to all ZooKeeper
@@ -121,7 +123,7 @@ n is the number of servers that make up a ZooKeeper service.
The zxid has two parts: the epoch and a counter. In our implementation the zxid
is a 64-bit number. We use the high order 32-bits for the epoch and the low
order
-32-bits for the counter. Because it has two parts represent the zxid both as a
+32-bits for the counter. Because zxid consists of two parts, zxid can be
represented both as a
number and as a pair of integers, (_epoch, count_). The epoch number
represents a
change in leadership. Each time a new leader comes into power it will have its
own epoch number. We have a simple algorithm to assign a unique zxid to a
proposal:
@@ -146,18 +148,15 @@ up with the leader, they have the same state. This state
consists of all of the
proposals that the leader believes have been committed and the proposal to
follow
the leader, the NEW_LEADER proposal. (Hopefully you are thinking to
yourself, _Does the set of proposals that the leader believes has been
committed
-included all the proposals that really have been committed?_ The answer is
_yes_.
+include all the proposals that really have been committed?_ The answer is
_yes_.
Below, we make clear why.)
<a name="sc_leaderElection"></a>
### Leader Activation
-Leader activation includes leader election. We currently have two leader
election
-algorithms in ZooKeeper: LeaderElection and FastLeaderElection
(AuthFastLeaderElection
-is a variant of FastLeaderElection that uses UDP and allows servers to perform
a simple
-form of authentication to avoid IP spoofing). ZooKeeper messaging doesn't care
about the
-exact method of electing a leader as long as the following holds:
+Leader activation includes leader election (`FastLeaderElection`).
+ZooKeeper messaging doesn't care about the exact method of electing a leader
as long as the following holds:
* The leader has seen the highest zxid of all the followers.
* A quorum of servers have committed to following the leader.
@@ -170,18 +169,18 @@ we will recover by abandoning leader activation and
running another election.
After leader election a single server will be designated as a leader and start
waiting for followers to connect. The rest of the servers will try to connect
to
-the leader. The leader will sync up with followers by sending any proposals
they
+the leader. The leader will sync up with the followers by sending any
proposals they
are missing, or if a follower is missing too many proposals, it will send a
full
snapshot of the state to the follower.
-There is a corner case in which a follower that has proposals, U, not seen
-by a leader arrives. Proposals are seen in order, so the proposals of U will
have a zxids
+There is a corner case in which a follower that has proposals, `U`, not seen
+by a leader arrives. Proposals are seen in order, so the proposals of `U` will
have a zxids
higher than zxids seen by the leader. The follower must have arrived after the
leader election, otherwise the follower would have been elected leader given
that
it has seen a higher zxid. Since committed proposals must be seen by a quorum
of
-servers, and a quorum of servers that elected the leader did not see U, the
proposals
-of you have not been committed, so they can be discarded. When the follower
connects
-to the leader, the leader will tell the follower to discard U.
+servers, and a quorum of servers that elected the leader did not see `U`, the
proposals
+of `U` have not been committed, so they can be discarded. When the follower
connects
+to the leader, the leader will tell the follower to discard `U`.
A new leader establishes a zxid to start using for new proposals by getting the
epoch, e, of the highest zxid it has seen and setting the next zxid to use to
be
@@ -226,9 +225,9 @@ the following operating constraints are observed:
received. Because we use FIFO channels this means that followers also
receive proposals in order.
* Followers process messages in the order they are received. This
means that messages will be ACKed in order and the leader will receive ACKs
from
- followers in order, due to the FIFO channels. It also means that if message
$m$
+ followers in order, due to the FIFO channels. It also means that if message
`m`
has been written to non-volatile storage, all messages that were proposed
before
- $m$ have been written to non-volatile storage.
+ `m` have been written to non-volatile storage.
* The leader will issue a COMMIT to all followers as soon as a
quorum of followers have ACKed a message. Since messages are ACKed in order,
COMMITs will be sent by the leader as received by the followers in order.
@@ -267,6 +266,26 @@ all packets, it all falls apart. Also, our leader
activation phase is different
both of them. In particular, our use of epochs allows us to skip blocks of
uncommitted
proposals and to not worry about duplicate proposals for a given zxid.
+<a name="sc_consistency"></a>
+
+
+## Consistency Guarantees
+
+ZooKeeper [consistency](https://jepsen.io/consistency) guarantees lie between
sequential consistency and linearizabiliy. Here, we explain the exact
consistency guarantees that ZooKepeer provides.
+
+Write operations in ZooKeeper are linearizabile. In other words, each write
appears to take effect atomically at some point between its invocation and its
response. This means that the writes performed by all the clients in ZooKeeper
can be totally ordered in such a way that respects the real-time ordering of
these writes. However, note that just stating that writes are linearizable is
meaningless unless we also talk about read operations.
+
+Read operations in ZooKeeper are not linearizable since they can return
potentially stale data. This occurs since a read in ZooKeeper is not a quorum
operation and a server responds immediately to a client that is performing a
read.
+Nevertheless, ZooKeeper makes this choice because it chooses performance in
the trade-off between performance and consistency. ZooKeeper read operations
are sequentially-consistent, since read operations appear to take effect in
some sequential order that furthermore respects the order of each client's
operations.
+If a client wants to read the freshest data, it is generally assumed that the
client should first perform a sync operation, and then a read.
+However, even with a sync before a read operation, a client might retrieve
stale data.
+This can occur because `sync` is [not a quorum
operation](https://issues.apache.org/jira/browse/ZOOKEEPER-1675). Such a
scenario might appear if two servers think that they are the leaders at the
same time, which may occur if the time it takes for a TCP connection to drop is
smaller than `syncLimit * tickTime`, something that is
[unlikely](https://www.amazon.com/ZooKeeper-Distributed-Coordination-Flavio-Junqueira/dp/1449361307)
to occur in practice.
+
+
+This raises the question on what are the exact consistency guarantees of
ZooKeeper?
+Formally, the ZooKeeper consistency guarantees are captured by the notion of
[ordered sequential
consistency](http://webee.technion.ac.il/people/idish/ftp/OSC-IPL17.pdf) or
`OSC(U)` to be exact, that lies between sequential consistency and
linearizability.
+Finally, note that the current version of ZooKeeper can provide
linearizability for both reads and writes, if every read is preceded by a write
to some dummy znode.
+
<a name="sc_quorum"></a>
## Quorums
@@ -313,8 +332,7 @@ of the [ZooKeeper Administrator's
Guide.](zookeeperAdmin.html)
### Developer Guidelines
Please follow the [slf4j manual](http://www.slf4j.org/manual.html) when
creating log statements within code.
-Also read the[FAQ on
performance](http://www.slf4j.org/faq.html#logging\_performance)
-, when creating log statements. Patch reviewers will look for the following:
+Also read the [FAQ on
performance](http://www.slf4j.org/faq.html#logging\_performance), when creating
log statements. Patch reviewers will look for the following:
<a name="sc_rightLevel"></a>