Over the phone Jeff asked me to start a discussion about the totem protocol, so here it is.
If anyone just wants to get it from the horse's mouth you can read this paper: "The Totem Single-Ring Ordering and Membership Protocol", Y. Amir, L. E. Moser, P. M. Melliar-Smith, D. A. Agarwal, and P. Ciarfella, ACM Transactions on Computer Systems 13, 4 (November 1995), 311-342. It's on the web. When implementing replication using multicasting, you need the following: 1. The ability to discern which 'processors' or 'nodes' are in "the group". 2. The ability to re-send dropped multicast packets. 3. Flow control. 4. The ability to compute at least a partial ordering on all messages. 5. The ability to avoid a split-brain cluster. 6. State transfer. Number 5 is not part of the totem protocol so I am not going to address it right now. You can google "quorum resource" or read Pfister's book "In Search of Clusters". Number is not really tied to reliable multicasting, so I am not going to address it here. Interesting issue. The membership protocol (1) is not a major challenge (although from the process-group theoretical viewpoint it may be the most interesting.) Totem has its own membership protocol. When you start up the totem protocol it detects the presence of other processors using join messages, which is like saying "i am here, and these are the other processors I have detected so far". You can read about the details in the totem paper above. The membership protocol also detects failures and totem has a recovery state to 'flush' the messages from the last ring. Again, see article. Imagine that you now have a 'ring' of processors numbered 1 through n. Processor number 1 can send some mulicast packets, with ids 1,2 etc. When it has nothing more to send, it sends a udp packet to processor 2. This packets is called the 'token'. In my implementation (evs4j) the token is also multicast to keep it simple. The token contains the number of the last message sent. Processor 2 may have missed any number of messages, including the token itself. Retransmission of the token is handled using a timeout (see article.) Once processor 2 has the token it looks at the last id sent and its own buffer of received messages and it detects any gaps. If there are any, then it lists the ids of the missed messages in the token before it sends it on. It also sends its own new messages before sending the token on. When a processor receives the token and sees that some other processor missed messages, it resends whatever it can. That's how totem adds reliability to multicasting. Ordering is trivial. Since each message has a positive integer as an id, totem does not deliver message N to the application unless it already delivered message N-1. Flow control is actually totem's strongest point. On a LAN, when a message is dropped it's usually because of buffer overflow at the receiver. Therefore throttling is the critical factor that needs to be addressed in order to reach the maximum theoretical throughput of the group as a whole. Totem implements flow control using a window which sets the maximum number of messages that may be sent during the current token rotation. It works pretty well. You can read the details in the article above. EVS4J has an extra feature which is not in the totem article, namely congestion control, aka window tuning. I used to test totem on my lan and telnetting to the various servers was a nightmare because the network cards were all packed with totem messages. I knew the problem had been solved in TCP using the Van Jacobson et al. algorithm, so I adapted this algorithm to the totem window. Now if you run the totem benchmark and try to transfer a big file on the same lan the window backs off. So now I still use a _maximum_ window size, but when there is extraneous load on the LAN the window is free to oscillate between zero (just sending the token around) and full throttle. As far as performance goes, I believe that on a regular pc and a fast ethernet _hub_ you evs4j will do at least 6000 messages per second (with 1500-byte messages) with an average latency of a few ms for a small ring. For a larger ring the latency can get out of hand, and indeed there is a whole other protocol called the totem "multiple-ring" protocol, which uses several rings and gateways between them. I didn't implement that because I think testing it for me would be too challenging. The thing I like about token is that its throughput and latency are predictable. The maximum window size is the tuning parameter. The bigger the window size, the greater the throughput (until you reach maximum) but also the greater the latency. So it's ideal for systems in a data center with high data-sharing requirements, whereas for loosely coupled clusters you can do better with a non-token-passing protocol. The problem with those is that flow control itself requires communication within nodes, so those protocols probably don't get close to the maximum theoretical throughput. This may be true even in protocols which offer no ordering guarantees beyond "fifo". I think this is probably enough information to start a discussion if anyone is interested. Guglielmo