[Hadoop Wiki] Update of "ZooKeeper/Observers" by HenryRobinson

Apache Wiki Wed, 15 Jul 2009 05:17:40 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The following page has been changed by HenryRobinson:
http://wiki.apache.org/hadoop/ZooKeeper/Observers

New page:
= Observers =

== Proposal ==

Observers are a type of participant in a ZooKeeper ensemble that do not take 
part in the underlying atomic broadcast protocol that ZK is built upon. In 
particular, they do not vote on new proposals and can not become Leaders. 

Observers are informed of all committed proposals in zxid order. Observers may 
issue proposals, and will be told about the outcome. 

== Use Cases ==

The main advantage of having Observers is to remove some of the burden of 
serving clients from ensemble nodes which are more concerned with correctly 
executing instances of ZAB in a timely fashion. Observers are not a critical 
component of a ZooKeeper ensemble, and therefore do not have to meet strict 
timing guarantees. They can therefore be more heavily loaded without fear of 
jeopardising the liveness of the ensemble.

This design provides read-scalability; i.e. the ability to add more read-only 
clients without compromising write performance. Because Observers do not vote 
or participate in leader elections the only impact of a careful implementation 
on ensemble performance should be the bookkeeping required to keep track of 
them at the Leader, plus the messaging costs associated with informing them of 
committed proposals.

In the current design, where clients can only connect to Followers, attaching 
too many clients can slow down a Follower or even crash it. The behaviour of 
clients is such that, like a swarm of insects, they will move to the next 
Follower which may well suffer the same fate. The performance of the cluster, 
and eventually its liveness, would be compromised. Observers can insulate the 
core ensemble against these problems. 

 * As a datacenter bridge: Forming a ZK ensemble between two datacenters is a 
problematic endeavour as the high variance in latency between the datacenters 
could lead to false positive failure detection and partitioning. However if the 
ensemble runs entirely in one datacenter, and the second datacenter runs only 
Observers, partitions aren't problematic as the ensemble remains connected. 
Clients of the Observers may still see and issue proposals.

 * As a link to a message bus: Some companies have expressed an interest in 
using ZK as a component of a persistent reliable message bus. Observers would 
give a natural integration point for this work: a plug-in mechanism could be 
used to attach the stream of proposals an Observer sees to a publish-subscribe 
system, again without loading the core ensemble. 

 * To support dynamic membership: once dynamic membership is enabled it will be 
common for nodes that intend to become Followers to connect to the ensemble in 
order to be ready to be informed of the membership change that admits them. 
Observers naturally capture the state of being connected but not being part of 
the voting ensemble.

== Proposed Design ==

An Observer is started much like any other ensemble peer. They find and connect 
to the Leader through the same mechanism as Followers. Instead of sending a 
FOLLOWERINFO packet, they send an OBSERVERINFO packet which tells the Leader 
that this is an Observer. The Leader from that point on can distinguish the 
Observer and doesn't send it any Follower-specific messages.

The Leader and the Observer sync up through the usual mechanism. The Leader 
then only sends INFORM messages to the Observer which are sent when a proposal 
has received enough votes to be committed. If the Leader receives an ACK from 
an Observer, or if the Observer receives a PROPOSAL from a Leader, it's an 
error.

Clients may connect to the Observer as if it were a Follower, and issue 
proposals, set watches and so on. The Observer forwards REQUEST packets to the 
Leader, and commits the resulting proposals once INFORM is received. Observers 
make the same guarantees about ordering of client proposals that Followers do. 
Observers will eventually see the whole proposal log due to syncing with the 
Leader upon every connection attempt. 

The fact that INFORM is the only message sent to an Observer for a proposal 
means that the message cost of Observers is one message per Observer per 
proposal, compared to three per Follower. However, this means that INFORM must 
contain enough information to commit each proposal (unlike COMMIT messages 
which just include a zxid and are matched against the already seen PROPOSAL); 
so the saving is essentially the ACK / COMMIT message pair. 

=== Backwards Compatibility ===

It is very important not to require a complete cluster restart for these 
changes, and to maintain backwards compatibility with existing data. 

There are two cases to consider:

 1. An Observer tries to connect to a pre-Observer cluster: The Observer will 
succeed in connecting, but once it sends an OBSERVERINFO packet the Leader will 
respond that it does not understand the packet type and close the connection.

 2. A pre-Observer Follower tries to connect to an Observer-aware cluster: The 
behaviour of Followers has not been changed. They still receive the same set of 
messages, and connect via the same protocols, as before. Therefore they will be 
able to successfully connect to an upgraded cluster. 

The data format has not changed, so logs will be backwards compatible. 
Downgrading will also be possible to a pre-Observer version.

=== Security ===

Current security in ZK is achieved in two ways: ACLs on individual znodes that 
enforce policies at the client, and a whitelist of ensemble nodes in the 
configuration file. 

Since Observers currently can connect from any source address, this removes all 
security from the cluster. We must, at least, implement a whitelist of IP 
ranges from which an Observer can connect. At this point security is the same 
as the current version - to attack a ZK cluster we must write a compromised 
Follower and run it from one of the IP addresses in the configuration file. 

We should look into authenticated connections, and if we implement the 
re-publish use case must take sure to only re-publish proposals which are 
authorised by an ACL provided to the Observer. 

=== Testing ===

We must test the claims made about backwards compatibility. Ideally we would 
have a running cluster and a suite of tests that can be run against it as we do 
a rolling upgrade of the ensemble nodes. This would be a helpful general tool 
for any major release.

Observers should be subject to the same test suite that Followers are 
(including stress tests etc).

We should also test the basic properties of Observers: never receiving a 
proposal, not voting to elect a leader (can maybe do this by constructing a 
situation where a leader would be elected iff the Observer voted), seeing 
proposals in zxid order etc.

=== Performance ===

The ensemble performance should remain reasonably unchanged. There are two 
areas where extra load is placed on the system:

 1. The Leader must keep track of all Observers. The cost is the same as for 
each Follower - in fact in the current implementation FollowerHandlers are used 
to handle Observer connections.

 2. Observers must receive every proposal via a single INFORM message, adding N 
messages to every proposal for N Observers. 

Regression testing of existing benchmarks will help validate these claims. 

== Proposed Implementation ==

Observers duplicate a great deal of functionality from Followers. Therefore I 
have introduced a Peer base class that contains common code. There is an 
accompanying PeerZooKeeperServer class from which ObserverZooKeeperServer and 
FollowerZooKeeperServer inherit. 

Observers currently have the same RequestProcessor pipeline as Followers. This 
might not be the case in the future. 

Some of the code in the current patch is preparation for the dynamic membership 
patch which is built upon this one. (Some of these changes need erasing for the 
final version of this patch). There is PeerType enum which describes whether a 
Peer is a PARTICIPANT (i.e. will follow if able to) or if it is an OBSERVER. 
Based on that the LeaderElection process knows which state to move the Peer 
into once it has found a leader; from LOOKING to either FOLLOWING or OBSERVING.

Setting peerType=observer in a server's configuration file will ensure that its 
PeerType=OBSERVER. Otherwise it defaults to PARTICIPANT. 

== What's left to do? ==

 1. Different timeout for Observers - can be much more lax about timeouts.
 2. Evaluate impact on JMX
 3. Update four-letter commands
 4. Final version of patch that cleans up rough edges, provides documentation 
and test cases.
 5. ...

[Hadoop Wiki] Update of "ZooKeeper/Observers" by HenryRobinson

Reply via email to