Re: Replication using totem protocol

Rajith Attapattu Mon, 16 Jan 2006 13:52:35 -0800

This is a very educating thread, maybe Jules can incoporate some of the ideas into your document on clustering.

I do have a few questions on Guglielmo's points on session replication. (Not an expert so pls bear with my questions)

>1. The user should configure a minimum-degree-of-replication R. This is
>the number of replicas of a specific session which need to be available in
>order for an HTTP request to be serviced.

1.) How do u figure out the most efficient value for R?

I assume when R increases, network chatter increases at a magnitue of X, and X depends on wether it's a multicast protocol or 1->1 (first of all is this assumption correct ???).

And when R reduces the chances of a request hitting a server where the session is not replicated is high.

So the sweet spot is a balance btw the above to factors ??? or have I missed any other critical factor(s) ??

2.) When you say minimum-degree-of-replication it imples to me a floor?? is there like a ceiling value like maximum-degree-of-replication?? I guess we don't want the session to grow beyond a point.

>2. When an HTTP request arrives, if the cluster which received does not
>have R copies then it blocks (it waits until there are.) This should in
>data centers because partitions are likely to be very short-lived (aka
>virtual partitions, which are due to congestion, not to any hardware
>issue.)

1) Can u pls elaborate a bit more on this, didn't really understand it, when u said wait untill, does it mean

a) wait till there are R no of replicas in the cluster?

b) or until a session is replicated within the server the http request is received?

2) when u said virtual partition did u mean a sub set of nodes being isolated due to congestion. By isolation I meant they have not able to replicate there sessions or receive replications from sessions from other nodes outside of the subset due to congestion. Is this correct??

3) Assuming an HTTP request arrives and the cluster does not have R copies. How different is this situation from "an HTTP request arrives but no session replication in that server" ??

>3. If at any time an HTTP reaches a server which does not have itself a
>replica of the session it sends a client redirect to a node which does.
How can this be achived?? Is it by having a central cordinator that handles a mapping or is this information replicated in all nodes on the entire cluster.

information == "which clusters have replicas of each session"

The point below gave me the impression that some of inventory has to be maintained centrally or cluster-wide (ideally in case controller dies).

>4. When a new cluster is formed (with nodes coming or going), it takes an
>inventory of all the sessions and their version numbers. Sessions which do
>not have the necessary degree of replication need to be fixed, which will
>require some state transfer, and possibly migration of some session for
>proper load balancing.

Again how does the replication healing/shedding works. Assuming nodes die or comeback with carrying there state

how does the cluster decide on adding or removing sessions to maintain the optimal R value.

Where does the brain/logic for this sit?? Ideally distributable in case the controller dies.

General comments/questions

-------------------------------------------

1. How much does the current impls like WADI, ACluster and ASpace address those above concerns?

2.) What aspects of the above concerns can be addresed with totem better than other protocols?

3. Can SEDA like architechture solve the problem of deciding the value of R dynamically runtime from time to time based on load and network latency?? I guess the network latency can be messured with some metrics around token passing or something like that.

Answers are greatly appreciated.

Regards,

Rajith.

On 1/16/06, lichtner <[EMAIL PROTECTED]> wrote:

On Mon, 16 Jan 2006, Jules Gosnell wrote:

> REMOVE_NODE is when a node leaves cleanly, FAILED_NODE when a node dies ...

I figured. I imagine that if I had to add this distinction to totem I
would add a message were the node in question announces that it is
leaving, and then stops forwarding the token. On the other hand, it does
not need to announce anything, and the other nodes will detect that it
left. In fact totem does not judge a node either way: you can leave
because you want to or under duress, and the consequences as far
distribute algorithms are probably minimal. I think the only where this
might is for logging purposes (but that could be handled at the
application level) or to speed the membership protocol, although it's
already pretty fast.

So I would not draw a distinction there.

> By also treating nodes joining, leaving and dieing, as split and merge
> operations I can reduce the number of cases that I have to deal with.

I would even add that the difference is known only to the application.

> and ensure that what might be very uncommonly run code (run on network
> partition/healing) is the same code that is commonly run on e.g. node
> join/leave - so it is likely to be more robust.

Sounds good.

> In the case of a binary split, I envisage two sets of nodes losing
> contact with each other. Each cluster fragment will repair its internal
> structure. I expect that after this repair, neither fragment will carry
> a complete copy of the cluster's original state (unless we are
> replicating 1->all, which WADI will usually not do), rather, the two
> datasets will intersect and their union will be the original dataset.
> Replicated state will carry a version number.

I think a version number should work very well.

> If client affinity survives the split (i.e. clients continue to talk to
> the same nodes), then we should find ourselves in a working state, with
> two smaller clusters carrying overlapping and diverging state. Each
> piece of state should be static in one subcluster and divergant in the
> other (it has only one client). The version carried by each piece of
> state may be used to decide which is the most recent version.
>
> (If client affinity is not maintained, then, without a backchannel of
> some sort, we are in trouble).
>
> When a merge occurs, WADI will be able to merge the internal
> representations of the participants, delegating awkward decisions about
> divergant state to deploy-time pluggable algorithms. Hopefully, each
> piece of state will only have diverged in one cluster fragment so the
> choosing which copy to go forward with will be trivial.

> A node death can just be thought of as a 'split' which never 'merges'.

Definitely :)

> Of course, multiple splits could occur concurrently and merging them is
> a little more complicated than I may have implied, but I am getting
> there....

Although I consider the problem of session replication less than
glamorous, since it is at hand, I would approach it this way:

1. The user should configure a minimum-degree-of-replication R. This is
the number of replicas of a specific session which need to be available in
order for an HTTP request to be serviced.

2. When an HTTP request arrives, if the cluster which received does not
have R copies then it blocks (it waits until there are.) This should in
data centers because partitions are likely to be very short-lived (aka
virtual partitions, which are due to congestion, not to any hardware
issue.)

3. If at any time an HTTP reaches a server which does not have itself a
replica of the session it sends a client redirect to a node which does.

4. When a new cluster is formed (with nodes coming or going), it takes an
inventory of all the sessions and their version numbers. Sessions which do
not have the necessary degree of replication need to be fixed, which will
require some state transfer, and possibly migration of some session for
proper load balancing.

Guglielmo

Re: Replication using totem protocol

Reply via email to