Re: Clustering (long)

Jules Gosnell Tue, 02 Aug 2005 16:04:28 -0700

I've had a look at the Lampson paper, but didn't take it all in on thefirst pass - I think it will need some serious concentration. The Paxosalgorithm looks interesting, I will definitely pursue this avenue.

I've also given a little thought to exactly why I need a Coordinator andhow Paxos might be used to replace it. My use of a Coordinator and plansfor its future do not actually seem that far from Paxos, on apreliminary reading.

Given that WADI currently uses a distributed map ofsessionId:sessionLocation, that this distribution is achieved by sharingout responsibility for the set number of buckets that comprise the maproughly evenly between the cluster members and that this is currently mymost satisfying design, I can break my problem space (for bucketarrangement) down into 3 basic cases :


1) Node joins
2) Node leaves in controlled fashion
3) Node dies

If the node under discussion is the only cluster member, then no bucketrearrangement is necessary - this node will either create or destroy thefull set of buckets. I'll leave this set of subcases as trivial.

1) The joining node will need to assume responsibility for a number ofbuckets. If buckets-per-node is to be kept roughly the same for everynode, it is likely that the joining node will require transfer of asmall number of buckets from every current cluster member i.e. we arestarting a bucket rearrangement that will involve every cluster memberand only need be done if the join is successful. So, although we wish toavoid an SPoF, if that SPoF turns out to be the joining node, then Idon't see it as a problem, If the node joining dies, then we no longerhave to worry about rearranging our buckets (unless we have lost somethat had already been transferred - see (3)). Thus the joining node maybe used as a single Coordinator/Leader for this negotiation without fearof the SPoF problem. Are we on the same page here ?

2) The same argument may be applied in reverse to a node leaving in acontrolled fashion. It will wish to evacuate its buckets roughly equallyto all remaining cluster members. If it shuts down cleanly, this wouldform part of its shutdown protocol. If it dies before or during theexecution of this protocol then we are back at (3), if not, then theSPoF issue may again be put to one side.

3) This is where things get tricky :-) Currently WADI has, for the sakeof simplicity, one single algorithm / thread / point-of-failure whichrecalculates a complete bucket arrangement if it detects (1), (2) or(3). It would be simple enough to offload the work done for (1) and (2)to the node joining/leaving and this should reduce wadi's currentvulnerability, but we still need to deal with catastrophic failure.Currently WADI rebuilds the missing buckets by querying the cluster forthe locations of any sessions that fall within them, but it couldequally carry a replicated backup and dust it off as part of thisprocedure. It's just a trade-off between work done up front and workdone in exceptional circumstance... This is the place where the Paxosalgorithm may come in handy - bucet recomposition and rearrangement. Ineed to give this further thought. For the immediate future, however, Ithink WADI will stay with a single Coordinator in this situation, whichfails-over if http://activecluster.codehaus.org says it should - I'mdelegating the really thorny problem to James :-). I agree with you thatthis is an SPoF and that WADI's ability to recover from failure heredepends directly on how we decide if a node is alive or dead - a verytricky thing to do.

In conclusion then, I think that we have usefully identified a weaknessthat will become more relevant as the rest of WADI's features mature.The Lampson paper mentioned describes an algorithm for allowing nodes toreach a consensus on actions to be performed, in a redundant manner withno SPoF and I shall consider how this might replace WADI's currentlysingle Coordintor, whilst also looking at performing other Coordinationon joining/leaving nodes where its failure, coinciding with that of itshost node, will be irrelevant, since the very condition that it wasintended to resolve has ceased to exist.

How does that sound, Andy ? Do you agree with my thoughts on (1) & (2) ?This is great input - thanks,



Jules


Jules Gosnell wrote:

Andy Piper wrote:
Hi Jules

At 05:37 AM 7/27/2005, Jules Gosnell wrote:
I agree on the SPoF thing - but I think you misunderstand myCoordinator arch. I do not have a single static Coordinator node,but a dynamic Coordinator role, into which a node may be elected.Thus every node is a potential Coordinator. If the electedCoordinator dies, another is immediately elected. The electionstrategy is pluggable, although it will probably end up beinghardwired to "oldest-cluster-member". The reason behind this is thatrelaying out your cluster is much simpler if it is done in a singlevm. I originally tried to do it in multiple vms, each takingresponsibility for pieces of the cluster, but if the vms views arenot completely in sync, things get very hairy, and completely insync is an expensive thing to achieve - and would introduce acluster-wide single point of contention. So I do it in a single vm,as fast as I can, with fail over, in case that vm evaporates. Doesthat sound better than the scenario that you had in mind ?
This is exactly the "hard" computer science problem that youshouldn't be trying to solve if at all possible. Its hard becausenetwork partitions or hung processes (think GC) make it very easy foryour colleagues to think you are dead when you do not share thatview. The result is two processes who think they are the coordinatorand anarchy can ensue (commonly called split-brain syndrome). I canpoint you at papers if you want, but I really suggest that you aimfor an implementation that is independent of a central coordinator.Note that a central coordinator is necessary if you want to implementa strongly-consistent in-memory database, but this is not usually arequirement for session replication say.
http://research.microsoft.com/Lampson/58-Consensus/Abstract.htmlgives a good introduction to some of these things. I also presentedat JavaOne on related issues, you should be able to download thepresentation from dev2dev.bea.com at some point (not there yet - Ijust checked).
OK - I will have a look at these papers and reconsider... perhaps Ican come up with some sort of fractal algorithm which recursivelybreaks down the cluster into subclusters each of which is capable ofdoing likewise to itself and then layout the buckets recursively viathis metaphor... - this would be much more robust, as you point out,but, I think, a more complicated architecture. I will give it someserious thought. Have you any suggestions/papers as to how you mightdo something like this in a distributed manner, bearing in mind thatas a node joins, some existing nodes will see it as having joined andsome will not yet have noticed and vice-versa on leaving....
The Coordinator is not there to support session replication, butrather the management of the distributed map (map of which a fewbuckets live on each node) which is used by WADI to discover veryefficiently whether a session exists and where it is located. Thismap must be rearranged, in the most efficient way possible, eachtime a node joins or leaves the cluster.
Understood. Once you have a fault-tolerant singleton coordinator youcan solve lots of interesting problems, its just hard and often notworth the effort or the expense (typical implementations involve HAHW or an HA DB or at least 3 server processes).
Since I am only currently using the singleton coordinator for bucketarrangement, I may just live with it for the moment, in order to moveforward, but make a note to replace it and start background threads onhow that might be achieved...
Replication is NYI - but I'm running a few mental background threadsthat suggest that an extension to the index will mean that itassociates the session's id not just to its current location, butalso to the location of a number of replicants. I also have ideas onhow a session might choose nodes into which it will place itsreplicants and how I can avoid the primary session copy ever beingcolocated with a replicant (potential SPoF - if you only have onereplicant), etc...
Right definitely something you want to avoid.
Yes, I can see that happening - I have an improvement (NYI) toWADI's evacuation strategy (how sessions are evacuated when a nodewishes to leave). Each session will be evacuated to the node whichowns the bucket into which its id hashes. This is because colocationof the session with the bucket allows many messages concered withits future destruction and relocation to be optimised away. Futurerequests falling elsewhere but needing this session should, in themost efficient case, be relocated to this same node, other wise thesession may be relocated, but at a cost...
How do you relocate the request? Many HW load-balancers do notsupport this (or else it requires using proprietary APIs), so youprobably have to count on
moving sessions in the normal failover case.
If I can squeeze the behaviour that I require out of theload-balancer, then, depending on the request type I may be able toget away with a redirection with a changed session cookie or urlparam, or, failing this an http-proxy, across from a filter above theservlet on one side to the http-port on the node that owns the session...
The LB-integration object is pluggable and the aim is to supply wadiwith a good selection of LB integrations - currently I only have aModJK[2] plugin working. This is able to 'restick' clients to theirsession's new location (although messing with the session id is alittle dodgy...).
I would be very grateful in any thoughts or feedback that you couldgive me. I hope to get much more information about WADI into thewiki over the next few weeks. That should help generate morediscussion, although I would be more than happy for people to ask mequestions here on Geronimo-dev because this will give me an idea ofwhat documentation I should write and how existing documentation maybe lacking or misleading.
I guess my general comment would be that you might find it better tothink specifically about the end-user problem you are trying to solve(say session replication) and work towards a solution based on that.Most short-cuts / optimizations that vendors make are specific to theproblem domain and do not generally apply to all clustering problems.
The end problem is really clustered web and ejb sessions at themoment, although it looks as if by the time we have solved theseissues we may well have written a fault-tolerantdistributed/partitioned index that might be very useful as a genericdistributed cache building block.
One thing that I do want wadi to do, is to still work when replicationis switched off. i.e., if a session only exists as a primary copy,even if affinity breaks down, wadi will continue to correctly renderrequests for that session unless some form of catastrophic failurecauses the session to evaporate. This means that I need to ensure thesession's timely evacuation from a node that chooses to leave thecluster to a remaining node, so that it may remain active beyond thelifetime of its original node. All of this must work flawlessly understress, so that an admin may add or remove nodes to a running clusterwithout having to worry about the user state that it is managing.Nodes are added by simply starting them, and nodes removed via e.g.ctl-c-ing them.
If it is decided that a few more nines are needed in terms of sessionavailability and the cluster owner understands the extra cost involvedin in-vm replication in terms of extra hardware and bandwidth thatthey will have to purchase and is happy to go with in-vm-replication,then it should be sufficient to up the number of replicated copieskept by the cluster from '0' to e.g. '2' and restart (It might even bepossible to vary this setting on a node to node basis so that thischange does not even involve a complete cluster cold start). WADIshould deal with the rest.
So, I believe that I have a pretty clear idea of what WADI will do,and aside from the replication stuff (phase2) it currently does mostof what iIhad in mind for phase1, except that it is not yet happyunder stress. I figure it will probably take one or two moreredesign/reimplementation iterations to get it to this stage, then Ican consider replication.
I have spoken to members of the OpenEJB team about wadi's ability torelocate requests as well as sessions and we came to the conclusionthat it was just as applicable in the EJB world as the web world. Ifthe node an ejb client is talking to leaves the cluster in betweencalls, the client may try to contact it and then failover to anothernode that it hopes holds the session. If, due to other nodesleaving/joining it is not always clear which node will contain thesession, the ability to reply to an RMI and just say "not here -there!" - i.e. an rmi redirection - would not be hard to add and wouldresolve this situation. Transactions are another item which I havemarked phase2.
So, I am trying hard to stay very focussed on the problem domain,otherwise this will never get finished :-)
Right, off to read those papers now - thanks for your posting and yourinterest,
Jules
Hope this helps
andy



--
"Open Source is a self-assembling organism. You dangle a piece of
string into a super-saturated solution and a whole operating-system
crystallises out around it."

/**********************************
* Jules Gosnell
* Partner
* Core Developers Network (Europe)
*
*    www.coredevelopers.net
*
* Open Source Training & Support.
**********************************/

Re: Clustering (long)

Reply via email to