Re: Clustering (long)

Jules Gosnell Tue, 02 Aug 2005 07:22:10 -0700

Andy Piper wrote:

Hi Jules
At 05:37 AM 7/27/2005, Jules Gosnell wrote:
I agree on the SPoF thing - but I think you misunderstand myCoordinator arch. I do not have a single static Coordinator node, buta dynamic Coordinator role, into which a node may be elected. Thusevery node is a potential Coordinator. If the elected Coordinatordies, another is immediately elected. The election strategy ispluggable, although it will probably end up being hardwired to"oldest-cluster-member". The reason behind this is that relaying outyour cluster is much simpler if it is done in a single vm. Ioriginally tried to do it in multiple vms, each taking responsibilityfor pieces of the cluster, but if the vms views are not completely insync, things get very hairy, and completely in sync is an expensivething to achieve - and would introduce a cluster-wide single point ofcontention. So I do it in a single vm, as fast as I can, with failover, in case that vm evaporates. Does that sound better than thescenario that you had in mind ?
This is exactly the "hard" computer science problem that you shouldn'tbe trying to solve if at all possible. Its hard because networkpartitions or hung processes (think GC) make it very easy for yourcolleagues to think you are dead when you do not share that view. Theresult is two processes who think they are the coordinator and anarchycan ensue (commonly called split-brain syndrome). I can point you atpapers if you want, but I really suggest that you aim for animplementation that is independent of a central coordinator. Note thata central coordinator is necessary if you want to implement astrongly-consistent in-memory database, but this is not usually arequirement for session replication say.
http://research.microsoft.com/Lampson/58-Consensus/Abstract.html givesa good introduction to some of these things. I also presented atJavaOne on related issues, you should be able to download thepresentation from dev2dev.bea.com at some point (not there yet - Ijust checked).

OK - I will have a look at these papers and reconsider... perhaps I cancome up with some sort of fractal algorithm which recursively breaksdown the cluster into subclusters each of which is capable of doinglikewise to itself and then layout the buckets recursively via thismetaphor... - this would be much more robust, as you point out, but, Ithink, a more complicated architecture. I will give it some seriousthought. Have you any suggestions/papers as to how you might dosomething like this in a distributed manner, bearing in mind that as anode joins, some existing nodes will see it as having joined and somewill not yet have noticed and vice-versa on leaving....

The Coordinator is not there to support session replication, butrather the management of the distributed map (map of which a fewbuckets live on each node) which is used by WADI to discover veryefficiently whether a session exists and where it is located. Thismap must be rearranged, in the most efficient way possible, each timea node joins or leaves the cluster.
Understood. Once you have a fault-tolerant singleton coordinator youcan solve lots of interesting problems, its just hard and often notworth the effort or the expense (typical implementations involve HA HWor an HA DB or at least 3 server processes).

Since I am only currently using the singleton coordinator for bucketarrangement, I may just live with it for the moment, in order to moveforward, but make a note to replace it and start background threads onhow that might be achieved...

Replication is NYI - but I'm running a few mental background threadsthat suggest that an extension to the index will mean that itassociates the session's id not just to its current location, butalso to the location of a number of replicants. I also have ideas onhow a session might choose nodes into which it will place itsreplicants and how I can avoid the primary session copy ever beingcolocated with a replicant (potential SPoF - if you only have onereplicant), etc...
Right definitely something you want to avoid.
Yes, I can see that happening - I have an improvement (NYI) to WADI'sevacuation strategy (how sessions are evacuated when a node wishes toleave). Each session will be evacuated to the node which owns thebucket into which its id hashes. This is because colocation of thesession with the bucket allows many messages concered with its futuredestruction and relocation to be optimised away. Future requestsfalling elsewhere but needing this session should, in the mostefficient case, be relocated to this same node, other wise thesession may be relocated, but at a cost...
How do you relocate the request? Many HW load-balancers do not supportthis (or else it requires using proprietary APIs), so you probablyhave to count on
moving sessions in the normal failover case.

If I can squeeze the behaviour that I require out of the load-balancer,then, depending on the request type I may be able to get away with aredirection with a changed session cookie or url param, or, failing thisan http-proxy, across from a filter above the servlet on one side to thehttp-port on the node that owns the session...

The LB-integration object is pluggable and the aim is to supply wadiwith a good selection of LB integrations - currently I only have aModJK[2] plugin working. This is able to 'restick' clients to theirsession's new location (although messing with the session id is a littledodgy...).

I would be very grateful in any thoughts or feedback that you couldgive me. I hope to get much more information about WADI into the wikiover the next few weeks. That should help generate more discussion,although I would be more than happy for people to ask me questionshere on Geronimo-dev because this will give me an idea of whatdocumentation I should write and how existing documentation may belacking or misleading.
I guess my general comment would be that you might find it better tothink specifically about the end-user problem you are trying to solve(say session replication) and work towards a solution based on that.Most short-cuts / optimizations that vendors make are specific to theproblem domain and do not generally apply to all clustering problems.

The end problem is really clustered web and ejb sessions at the moment,although it looks as if by the time we have solved these issues we maywell have written a fault-tolerant distributed/partitioned index thatmight be very useful as a generic distributed cache building block.

One thing that I do want wadi to do, is to still work when replicationis switched off. i.e., if a session only exists as a primary copy, evenif affinity breaks down, wadi will continue to correctly render requestsfor that session unless some form of catastrophic failure causes thesession to evaporate. This means that I need to ensure the session'stimely evacuation from a node that chooses to leave the cluster to aremaining node, so that it may remain active beyond the lifetime of itsoriginal node. All of this must work flawlessly under stress, so that anadmin may add or remove nodes to a running cluster without having toworry about the user state that it is managing. Nodes are added bysimply starting them, and nodes removed via e.g. ctl-c-ing them.

If it is decided that a few more nines are needed in terms of sessionavailability and the cluster owner understands the extra cost involvedin in-vm replication in terms of extra hardware and bandwidth that theywill have to purchase and is happy to go with in-vm-replication, then itshould be sufficient to up the number of replicated copies kept by thecluster from '0' to e.g. '2' and restart (It might even be possible tovary this setting on a node to node basis so that this change does noteven involve a complete cluster cold start). WADI should deal with the rest.

So, I believe that I have a pretty clear idea of what WADI will do, andaside from the replication stuff (phase2) it currently does most of whatiIhad in mind for phase1, except that it is not yet happy under stress.I figure it will probably take one or two more redesign/reimplementationiterations to get it to this stage, then I can consider replication.

I have spoken to members of the OpenEJB team about wadi's ability torelocate requests as well as sessions and we came to the conclusion thatit was just as applicable in the EJB world as the web world. If the nodean ejb client is talking to leaves the cluster in between calls, theclient may try to contact it and then failover to another node that ithopes holds the session. If, due to other nodes leaving/joining it isnot always clear which node will contain the session, the ability toreply to an RMI and just say "not here - there!" - i.e. an rmiredirection - would not be hard to add and would resolve this situation.Transactions are another item which I have marked phase2.

So, I am trying hard to stay very focussed on the problem domain,otherwise this will never get finished :-)

Right, off to read those papers now - thanks for your posting and yourinterest,


Jules


Hope this helps

andy




--
"Open Source is a self-assembling organism. You dangle a piece of
string into a super-saturated solution and a whole operating-system
crystallises out around it."

/**********************************
* Jules Gosnell
* Partner
* Core Developers Network (Europe)
*
*    www.coredevelopers.net
*
* Open Source Training & Support.
**********************************/

Re: Clustering (long)

Reply via email to