This is a pretty academic debate, but I'm sorry: I'm a graduate student, I can't resist. Also: In case it isn't clear, I don't intend this as a criticism of Zookeeper: I think it is a great, and am likely to use it for a research project I am working on at the moment. I just want to make sure I understand its implementation.
Summary: I am still unconvinced. I still think the Zookeeper protocol is basically Paxos. On Jun 7, 2008, at 2:33 , Benjamin Reed wrote: > There are some basic elements that all atomic broadcast protocols > have, but I don't think that makes them all the same. Fair enough: I have had this same debate with others, so perhaps I hold a minority opinion: All the consensus protocols tend to have the same 2 rounds in order to reach a decision, in the failure recovery case, and 1 round in the failure free case. They also tend to have the same need for sequence numbers in the messages, and voting, etc. While some of the details differ, the gist of all them seems pretty similar to me. Hence, I consider all of them roughly equivalent. Part of the confusion may be that "Paxos" as defined by Lamport isn't really useful by itself: it is a protocol for deciding a single value. You need to add a bunch of tweaks to the base algorithm to produce a useful state machine replication system. These tweaks can all be done in slightly different ways. > A big difference between us and Paxos is that we never do phase1. By Phase 1 you mean the prepare/ack round in Paxos? Except the Zookeeper protocol *does* do this: They are equivalent to the "notification" messages sent by the Leader election algorithm. The notification messages and the acknowledgements contain the same information as would be exchanged by Paxos phase 1 messages. > Now you could say that the Propose/ACK phase is straight from Paxos, > but is is also part of classic 2 phase commit or pretty much any > other such protocol. Right: Any distributed consensus protocol will use voting like this. As you say, a Multi-Paxos implementation will have the exact same number of messages as Zookeeper does, with almost exactly the same message contents. > Looking at ZooKeeper messaging will not give you insight into Paxos > unfortunately. We take advantage of FIFO channels (TCP), which > allows us to assume ordering; How do you handle the cases where a TCP connection breaks and is then re-established? This can violate the FIFO assumption. Is there code to handle this case that I can look at? > we also make sure that there will never be a two messages with the > same zxid, which is why we do not need something like phase1 of Paxos; Paxos makes the same assumption: there cannot be two messages with the same proposal number. > We think it is a cool protocol, and it's something that is really > quite intuitive and easy to explain. (We have been pulling a paper > together on it. I'll put up a presentation we give on the ZooKeeper > site.) A cursory glance may make you think Paxos, but it really isn't. I look forward to reading a description of the protocol. Reverse engineering the protocol from the source code may not be the most reliable way to figure it out. I *think* I know how Zookeeper's algorithm works, but I could easily be mistaken in some edge cases. Evan -- Evan Jones http://evanjones.ca/ ------------------------------------------------------------------------- Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php _______________________________________________ Zookeeper-user mailing list Zookeeper-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/zookeeper-user