There are some basic elements that all atomic broadcast protocols have, but I don't think that makes them all the same.
There actually are two parts to the atomic broadcast protocol. The establishment of the leader, and the leading phase. A big difference between us and Paxos is that we never do phase1. You are correct that the NEW_LEADER proposal is a key reason (but not the only reason) we can skip phase1, but it isn't Paxos and we never deliver any messages with that proposal. Multi-Paxos does something kind of similar, but they still have to deal with some phase1 issues. The basic protocol of ZooKeeper is certainly not Paxos. The logic for the basic protocol is mostly in the Leader and Follower classes. It is extremely simple: Leader -Proposal-> Followers -ACK-> Leader -COMMIT-> Followers Now you could say that the Propose/ACK phase is straight from Paxos, but is is also part of classic 2 phase commit or pretty much any other such protocol. (To be honest the inspiration came from 2 phase commit. The challenge was to get around the recovery problems of 2 phase commit. Of course we aren't 2 phase commit either since we do not have to worry about aborts, which made things simpler for us.) Looking at ZooKeeper messaging will not give you insight into Paxos unfortunately. We take advantage of FIFO channels (TCP), which allows us to assume ordering; we also make sure that there will never be a two messages with the same zxid, which is why we do not need something like phase1 of Paxos; we maintain strict ordering; we also use the well known concept of epochs to ensure unique messages ids. (The zxid is a <epoch, proposal> pair.) We think it is a cool protocol, and it's something that is really quite intuitive and easy to explain. (We have been pulling a paper together on it. I'll put up a presentation we give on the ZooKeeper site.) A cursory glance may make you think Paxos, but it really isn't. A similar thing happens with ZooKeeper and Chubby. People seem to lump them together after a cursory glance, but ZooKeeper is not a lock service, requests never block each other, clients never block each other, we don't even deliver notifications synchronously; our reads are local fast reads; even operationally, ZooKeeper clients connect to followers whereas Chubby clients always connect to the master (this gets back to the fast read thing). The biggest thing we have in common is the hierarchal namespace, but hey, I'm a filesystem guy; what other kind of namespace is there? :) Put everything together, ZooKeeper is used differently than Chubby. They may both be used for the same problem, but the resulting solutions will be different. ben ----- Original Message ---- From: Evan Jones <[EMAIL PROTECTED]> To: Benjamin Reed <[EMAIL PROTECTED]> Cc: zookeeper-user@lists.sourceforge.net Sent: Friday, June 6, 2008 8:33:29 PM Subject: Re: [Zookeeper-user] Does ZooKeeper includes Paxos ? On Jun 6, 2008, at 17:42 , Benjamin Reed wrote: > No. Internally we use an atomic broadcast protocol (which is what > Paxos would provide us) to keep the replicas in sync, but the > protocol is much simpler than Paxos. After reading QuorumPeer, I think it is safe to say that Zookeeper implements Paxos. The mapping between your code and Lamport's terminology: Zookeeper -> Lamport (Paxos) Notification (FastLeaderElection.java) -> Prepare request NEWLEADER (Leader.java) -> Accept message (zxid, leader id) -> "proposal number" It seems more or less equivalent to me, and there are no obvious problems with it that I can find. Unrelated observations from browsing the source: 1. syncLimit has slightly different comments in the class header, and inline with the variable. 2. ResponderThread: It reads "state" from the enclosing class. This variable is not volatile. My weak understanding of the Java memory model is that without doing a "synchronizing operation," the JVM would be free to optimize away the access of this "state" variable. That is, the ResponderThread could get "stuck" in an old state. I could be wrong because this stuff is ridiculously hard. See the FAQ: http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html#volatile The accesses to the currentVote volatile variable probably gives you the memory barrier guarantees you need, except when LEADING. In that case, the ResponderThread never touches another volatile, so in theory the JVM could read state once and by happy with it forever after. Will it ever actually do that? I would be surprised. However, I think you would be safer making state volatile, so ResponderThread would *definitely* see a change immediately. FastLeaderElection.java line 224: The part of the condition after && is not needed: This is the else branch of an if statement, where the condition is exactly the first part. Hence, the part after && *must* be true. FastLeaderElection.java: I think it also has accesses of a set of variables between threads with no synchronization or volatiles, such as logicalclock. Evan -- Evan Jones http://evanjones.ca/ ------------------------------------------------------------------------- Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php _______________________________________________ Zookeeper-user mailing list Zookeeper-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/zookeeper-user ------------------------------------------------------------------------- Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php _______________________________________________ Zookeeper-user mailing list Zookeeper-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/zookeeper-user