Over the phone Jeff asked me to start a discussion about the totem
protocol, so here it is.

If anyone just wants to get it from the horse's mouth you can read this
paper:

"The Totem Single-Ring Ordering and Membership Protocol",
Y. Amir, L. E. Moser, P. M. Melliar-Smith, D. A. Agarwal, and P. Ciarfella,
ACM Transactions on Computer Systems 13, 4 (November 1995), 311-342.

It's on the web.

When implementing replication using multicasting, you need the following:

1. The ability to discern which 'processors' or 'nodes' are in "the group".

2. The ability to re-send dropped multicast packets.

3. Flow control.

4. The ability to compute at least a partial ordering on all messages.

5. The ability to avoid a split-brain cluster.

6. State transfer.

Number 5 is not part of the totem protocol so I am not going to address it
right now. You can google "quorum resource" or read Pfister's book "In
Search of Clusters".

Number is not really tied to reliable multicasting, so I am not going to
address it here. Interesting issue.

The membership protocol (1) is not a major challenge (although from the
process-group theoretical viewpoint it may be the most interesting.) Totem
has its own membership protocol. When you start up the totem protocol it
detects the presence of other processors using join messages, which is
like saying "i am here, and these are the other processors I have detected
so far". You can read about the details in the totem paper above. The
membership protocol also detects failures and totem has a recovery state
to 'flush' the messages from the last ring. Again, see article.

Imagine that you now have a 'ring' of processors numbered 1 through n.
Processor number 1 can send some mulicast packets, with ids 1,2 etc. When
it has nothing more to send, it sends a udp packet to processor 2. This
packets is called the 'token'. In my implementation (evs4j) the token is
also multicast to keep it simple. The token contains the number of the
last message sent. Processor 2 may have missed any number of messages,
including the token itself. Retransmission of the token is handled using a
timeout (see article.) Once processor 2 has the token it looks at the last
id sent and its own buffer of received messages and it detects any gaps.
If there are any, then it lists the ids of the missed messages in the
token before it sends it on. It also sends its own new messages before
sending the token on. When a processor receives the token and sees that
some other processor missed messages, it resends whatever it can.

That's how totem adds reliability to multicasting.

Ordering is trivial. Since each message has a positive integer as an id,
totem does not deliver message N to the application unless it already
delivered message N-1.

Flow control is actually totem's strongest point. On a LAN, when a message
is dropped it's usually because of buffer overflow at the receiver.
Therefore throttling is the critical factor that needs to be addressed in
order to reach the maximum theoretical throughput of the group as a whole.
Totem implements flow control using a window which sets the maximum number
of messages that may be sent during the current token rotation. It works
pretty well. You can read the details in the article above.

EVS4J has an extra feature which is not in the totem article, namely
congestion control, aka window tuning. I used to test totem on my lan and
telnetting to the various servers was a nightmare because the network
cards were all packed with totem messages. I knew the problem had been
solved in TCP using the Van Jacobson et al. algorithm, so I adapted this
algorithm to the totem window. Now if you run the totem benchmark and try
to transfer a big file on the same lan the window backs off. So now I
still use a _maximum_ window size, but when there is extraneous load on
the LAN the window is free to oscillate between zero (just sending the
token around) and full throttle.

As far as performance goes, I believe that on a regular pc and a fast
ethernet _hub_ you evs4j will do at least 6000 messages per second (with
1500-byte messages) with an average latency of a few ms for a small ring.
For a larger ring the latency can get out of hand, and indeed there is a
whole other protocol called the totem "multiple-ring" protocol, which uses
several rings and gateways between them. I didn't implement that because I
think testing it for me would be too challenging.

The thing I like about token is that its throughput and latency are
predictable. The maximum window size is the tuning parameter. The bigger
the window size, the greater the throughput (until you reach maximum) but
also the greater the latency. So it's ideal for systems in a data center
with high data-sharing requirements, whereas for loosely coupled clusters
you can do better with a non-token-passing protocol. The problem with
those is that flow control itself requires communication within nodes, so
those protocols probably don't get close to the maximum theoretical
throughput. This may be true even in protocols which offer no ordering
guarantees beyond "fifo".

I think this is probably enough information to start a discussion if
anyone is interested.

Guglielmo


Reply via email to