lichtner wrote:

As Jules requested I am looking at the AC api. I report my observations
below:

ClusterEvent appears to represent membership-related events. These you
can generate from evs4j, as follows: write an adapter that implements
evs4j.Listener. In the onConfiguration(..) method you get notified of
new configurations (new groups). You can generate ClusterEvent.ADD_NODE
etc. by doing a diff of the old configuration and the new one.
Evs4j does not support arbitrary algorithms for electing coordinators.
In fact, in totem there is no coordinator. If a specific election is
important for you, you can design one using totem's messages. If not,
in evs4j node names are integers, so the coordinator can be the lowest
integer. This is checked by evs4j.Configuration.getCoordinator().

I don't know the difference between REMOVE_NODE and FAILED_NODE. In totem
there is no difference between the two.
REMOVE_NODE is when a node leaves cleanly, FAILED_NODE when a node dies ...

I don't think REMOVE_NODE is actually tied to an Event in the external API....

The only other class I think I need to comment on is Cluster. It
resembles a jms session, even being coupled to actual jms interfaces. You
can definitely implement producers and consumers and put them on top of
evs4j. The method send(Destination, Message) would have to encode Message
on top of fixed-length evs4j messages. No problem here.

Personally, I would not have mixed half the jms api with an original api.
I don't think it sends a good message as far as risk management goes. I
think people are prepared to deal with a product that says 'we assume jms'
or 'we are completely home-grown because we are so much better', but not a
mix of the two. Anyway that's not for me to say. Whatever works.
I'll leave this one to James.

In conclusion, yes, I think you could build an implementation of AC on top
of evs4j.

BTW, how does AC defend against the problem of a split-brain cluster?
Shared scsi disk? Majority voting? Curious.
Well, I think AC's approach is that it is an app-space problem - but James may care to comment.

As an AC user I am considering the following approach for WADI (HttpSession and SFSB clustering solution).

(1) simplifying the notifications that I might get from a cluster to the following :
- Split : 0-N nodes have left the cluster
- Merge : 0-N nodes have joined the cluster
- Change : A node has updated it's public, distributed state

'Split' should now be generic enough to encompass the following common cases:

- a node leaving cleanly (having evacuated its state, therefore carrying NO state) [clean node shutdown]
- a node dieing (therefore carrying state) [catastrophic failure]
- a group of nodes falling out of contact (still carrying state) [network partition]

'Join' can encompass:

- a new node joining (therefore carrying NO state).
- a group of nodes coming back into contact after a split (carrying state that needs to be merged) [network healing]


'Change'

- same as in AC - each node makes public by distribution, a small amount of data, this is republished each time it is updated.


By also treating nodes joining, leaving and dieing, as split and merge operations I can reduce the number of cases that I have to deal with, and ensure that what might be very uncommonly run code (run on network partition/healing) is the same code that is commonly run on e.g. node join/leave - so it is likely to be more robust.

In the case of a binary split, I envisage two sets of nodes losing contact with each other. Each cluster fragment will repair its internal structure. I expect that after this repair, neither fragment will carry a complete copy of the cluster's original state (unless we are replicating 1->all, which WADI will usually not do), rather, the two datasets will intersect and their union will be the original dataset. Replicated state will carry a version number.

If client affinity survives the split (i.e. clients continue to talk to the same nodes), then we should find ourselves in a working state, with two smaller clusters carrying overlapping and diverging state. Each piece of state should be static in one subcluster and divergant in the other (it has only one client). The version carried by each piece of state may be used to decide which is the most recent version.

(If client affinity is not maintained, then, without a backchannel of some sort, we are in trouble).

When a merge occurs, WADI will be able to merge the internal representations of the participants, delegating awkward decisions about divergant state to deploy-time pluggable algorithms. Hopefully, each piece of state will only have diverged in one cluster fragment so the choosing which copy to go forward with will be trivial.

A node death can just be thought of as a 'split' which never 'merges'.

Of course, multiple splits could occur concurrently and merging them is a little more complicated than I may have implied, but I am getting there....

WADI's approach should work for HttpSession and SFSB, where there is a single client who will be talking to a single node. In the case of some other type, where clients for the same resource may end up in different cluster fragments, this approach will be insufficient.

I would be very interested in hearing your thoughts on the subject,


Jules

Guglielmo


--
"Open Source is a self-assembling organism. You dangle a piece of
string into a super-saturated solution and a whole operating-system
crystallises out around it."

/**********************************
* Jules Gosnell
* Partner
* Core Developers Network (Europe)
*
*    www.coredevelopers.net
*
* Open Source Training & Support.
**********************************/

Reply via email to