Re: Replication using totem protocol

Jules Gosnell Mon, 16 Jan 2006 04:44:02 -0800

lichtner wrote:

As Jules requested I am looking at the AC api. I report my observations
below:


ClusterEvent appears to represent membership-related events. These you
can generate from evs4j, as follows: write an adapter that implements
evs4j.Listener. In the onConfiguration(..) method you get notified of
new configurations (new groups). You can generate ClusterEvent.ADD_NODE
etc. by doing a diff of the old configuration and the new one.
Evs4j does not support arbitrary algorithms for electing coordinators.
In fact, in totem there is no coordinator. If a specific election is
important for you, you can design one using totem's messages. If not,
in evs4j node names are integers, so the coordinator can be the lowest
integer. This is checked by evs4j.Configuration.getCoordinator().

I don't know the difference between REMOVE_NODE and FAILED_NODE. In totem
there is no difference between the two.

REMOVE_NODE is when a node leaves cleanly, FAILED_NODE when a node dies ...

I don't think REMOVE_NODE is actually tied to an Event in the externalAPI....

The only other class I think I need to comment on is Cluster. It
resembles a jms session, even being coupled to actual jms interfaces. You
can definitely implement producers and consumers and put them on top of
evs4j. The method send(Destination, Message) would have to encode Message
on top of fixed-length evs4j messages. No problem here.

Personally, I would not have mixed half the jms api with an original api.
I don't think it sends a good message as far as risk management goes. I
think people are prepared to deal with a product that says 'we assume jms'
or 'we are completely home-grown because we are so much better', but not a
mix of the two. Anyway that's not for me to say. Whatever works.

I'll leave this one to James.

In conclusion, yes, I think you could build an implementation of AC on top
of evs4j.

BTW, how does AC defend against the problem of a split-brain cluster?
Shared scsi disk? Majority voting? Curious.

Well, I think AC's approach is that it is an app-space problem - butJames may care to comment.

As an AC user I am considering the following approach for WADI(HttpSession and SFSB clustering solution).

(1) simplifying the notifications that I might get from a cluster to thefollowing :

- Split : 0-N nodes have left the cluster
- Merge : 0-N nodes have joined the cluster
- Change : A node has updated it's public, distributed state

'Split' should now be generic enough to encompass the following commoncases:

- a node leaving cleanly (having evacuated its state, therefore carryingNO state) [clean node shutdown]

- a node dieing (therefore carrying state) [catastrophic failure]

- a group of nodes falling out of contact (still carrying state)[network partition]


'Join' can encompass:

- a new node joining (therefore carrying NO state).

- a group of nodes coming back into contact after a split (carryingstate that needs to be merged) [network healing]



'Change'

- same as in AC - each node makes public by distribution, a small amountof data, this is republished each time it is updated.

By also treating nodes joining, leaving and dieing, as split and mergeoperations I can reduce the number of cases that I have to deal with,and ensure that what might be very uncommonly run code (run on networkpartition/healing) is the same code that is commonly run on e.g. nodejoin/leave - so it is likely to be more robust.

In the case of a binary split, I envisage two sets of nodes losingcontact with each other. Each cluster fragment will repair its internalstructure. I expect that after this repair, neither fragment will carrya complete copy of the cluster's original state (unless we arereplicating 1->all, which WADI will usually not do), rather, the twodatasets will intersect and their union will be the original dataset.Replicated state will carry a version number.

If client affinity survives the split (i.e. clients continue to talk tothe same nodes), then we should find ourselves in a working state, withtwo smaller clusters carrying overlapping and diverging state. Eachpiece of state should be static in one subcluster and divergant in theother (it has only one client). The version carried by each piece ofstate may be used to decide which is the most recent version.

(If client affinity is not maintained, then, without a backchannel ofsome sort, we are in trouble).

When a merge occurs, WADI will be able to merge the internalrepresentations of the participants, delegating awkward decisions aboutdivergant state to deploy-time pluggable algorithms. Hopefully, eachpiece of state will only have diverged in one cluster fragment so thechoosing which copy to go forward with will be trivial.


A node death can just be thought of as a 'split' which never 'merges'.

Of course, multiple splits could occur concurrently and merging them isa little more complicated than I may have implied, but I am gettingthere....

WADI's approach should work for HttpSession and SFSB, where there is asingle client who will be talking to a single node. In the case of someother type, where clients for the same resource may end up in differentcluster fragments, this approach will be insufficient.


I would be very interested in hearing your thoughts on the subject,


Jules

Guglielmo



--
"Open Source is a self-assembling organism. You dangle a piece of
string into a super-saturated solution and a whole operating-system
crystallises out around it."

/**********************************
* Jules Gosnell
* Partner
* Core Developers Network (Europe)
*
*    www.coredevelopers.net
*
* Open Source Training & Support.
**********************************/

Re: Replication using totem protocol

Reply via email to