There are some basic elements that all atomic broadcast protocols have, but I 
don't think that makes them all the same.

There actually are two parts to the atomic broadcast protocol. The 
establishment of the leader, and the leading phase.

A big difference between us and Paxos is that we never do phase1. You are 
correct that the NEW_LEADER proposal is a key reason (but not the only reason) 
we can skip phase1, but it isn't Paxos and we never deliver any messages with 
that proposal. Multi-Paxos does something kind of similar, but they still have 
to deal with some phase1 issues.

The basic protocol of ZooKeeper is certainly not Paxos. The logic for the basic 
protocol is mostly in the Leader and Follower classes. It is extremely simple:

Leader -Proposal-> Followers -ACK-> Leader -COMMIT-> Followers

Now you could say that the Propose/ACK phase is straight from Paxos, but is is 
also part of classic 2 phase commit or pretty much any other such protocol. (To 
be honest the inspiration came from 2 phase commit. The challenge was to get 
around the recovery problems of 2 phase commit. Of course we aren't 2 phase 
commit either since we do not have to worry about aborts, which made things 
simpler for us.)

Looking at ZooKeeper messaging will not give you insight into Paxos 
unfortunately. We take advantage of FIFO channels (TCP), which allows us to 
assume ordering; we also make sure that there will never be a two messages with 
the same zxid, which is why we do not need something like phase1 of Paxos; we 
maintain strict ordering; we also use the well known concept of epochs to 
ensure unique messages ids. (The zxid is a <epoch, proposal> pair.)

We think it is a cool protocol, and it's something that is really quite 
intuitive and easy to explain. (We have been pulling a paper together on it. 
I'll put up a presentation we give on the ZooKeeper site.) A cursory glance may 
make you think Paxos, but it really isn't. A similar thing happens with 
ZooKeeper and Chubby. People seem to lump them together after a cursory glance, 
but ZooKeeper is not a lock service, requests never block each other, clients 
never block each other, we don't even deliver notifications synchronously; our 
reads are local fast reads; even operationally, ZooKeeper clients connect to 
followers whereas Chubby clients always connect to the master (this gets back 
to the fast read thing). The biggest thing we have in common is the hierarchal 
namespace, but hey, I'm a filesystem guy; what other kind of namespace is 
there? :) Put everything together, ZooKeeper is used differently than Chubby. 
They may both be used for the same problem,
 but the resulting solutions will be different.

ben

----- Original Message ----
From: Evan Jones <[EMAIL PROTECTED]>
To: Benjamin Reed <[EMAIL PROTECTED]>
Cc: zookeeper-user@lists.sourceforge.net
Sent: Friday, June 6, 2008 8:33:29 PM
Subject: Re: [Zookeeper-user] Does ZooKeeper includes Paxos ?

On Jun 6, 2008, at 17:42 , Benjamin Reed wrote:
> No. Internally we use an atomic broadcast protocol (which is what  
> Paxos would provide us) to keep the replicas in sync, but the  
> protocol is much simpler than Paxos.

After reading QuorumPeer, I think it is safe to say that Zookeeper  
implements Paxos. The mapping between your code and Lamport's  
terminology:

Zookeeper -> Lamport (Paxos)
Notification (FastLeaderElection.java) -> Prepare request
NEWLEADER (Leader.java) -> Accept message
(zxid, leader id) -> "proposal number"


It seems more or less equivalent to me, and there are no obvious  
problems with it that I can find.


Unrelated observations from browsing the source:

1. syncLimit has slightly different comments in the class header, and  
inline with the variable.

2. ResponderThread: It reads "state" from the enclosing class. This  
variable is not volatile. My weak understanding of the Java memory  
model is that without doing a "synchronizing operation," the JVM would  
be free to optimize away the access of this "state" variable. That is,  
the ResponderThread could get "stuck" in an old state. I could be  
wrong because this stuff is ridiculously hard. See the FAQ:

http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html#volatile

The accesses to the currentVote volatile variable probably gives you  
the memory barrier guarantees you need, except when LEADING. In that  
case, the ResponderThread never touches another volatile, so in theory  
the JVM could read state once and by happy with it forever after. Will  
it ever actually do that? I would be surprised. However, I think you  
would be safer making state volatile, so ResponderThread would  
*definitely* see a change immediately.


FastLeaderElection.java line 224: The part of the condition after &&  
is not needed: This is the else branch of an if statement, where the  
condition is exactly the first part. Hence, the part after && *must*  
be true.

FastLeaderElection.java: I think it also has accesses of a set of  
variables between threads with no synchronization or volatiles, such  
as logicalclock.


Evan

--
Evan Jones
http://evanjones.ca/


-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Zookeeper-user mailing list
Zookeeper-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/zookeeper-user


-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Zookeeper-user mailing list
Zookeeper-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/zookeeper-user

Reply via email to