Re: Clustering (long)

Joe Bohn Wed, 03 Aug 2005 05:47:16 -0700

You can define an order to the semaphores when locking and thereby avoida deadlock. If each node being added or terminating itself honors theorder then you will never have a deadlock. However, you still need todeal with the case of an uncontrolled failure either adding or removinga note and possibly never releasing a lock.

Joe


Jules Gosnell wrote:

hmm... hmmm... :-)

more thoughts on (1) and (2)...
When a node leaves/joins it needs to acquire a lease on the buckettables of every node that it intends to move buckets from/to. If twonodes are doing this at the same time, their requirement will collide(deadlock) somewhere in the cluster. At this point they may benotified and e.g. compare ip addresses to decide who continues and whobacks off for a while.
So, (1) and (2), whilst being possible are probably more complex thanI initially imagined. If we have Paxos for the more general purposecase (3) anyway, it would probably be smart just to go with this,until such optimisations becomes necessary, if at all.
Jules


Jules Gosnell wrote:
hmmm...
now I'm wondering about my solutions to (1) and (2) - if more thanone node tries to join or leave at the same time I may be in trouble- so it may be safer to go straight to (3) for all cases...
more thought needed :-)

Jules



Jules Gosnell wrote:
I've had a look at the Lampson paper, but didn't take it all in onthe first pass - I think it will need some serious concentration.The Paxos algorithm looks interesting, I will definitely pursue thisavenue.
I've also given a little thought to exactly why I need a Coordinatorand how Paxos might be used to replace it. My use of a Coordinatorand plans for its future do not actually seem that far from Paxos,on a preliminary reading.
Given that WADI currently uses a distributed map ofsessionId:sessionLocation, that this distribution is achieved bysharing out responsibility for the set number of buckets thatcomprise the map roughly evenly between the cluster members and thatthis is currently my most satisfying design, I can break my problemspace (for bucket arrangement) down into 3 basic cases :
1) Node joins
2) Node leaves in controlled fashion
3) Node dies
If the node under discussion is the only cluster member, then nobucket rearrangement is necessary - this node will either create ordestroy the full set of buckets. I'll leave this set of subcases astrivial.
1) The joining node will need to assume responsibility for a numberof buckets. If buckets-per-node is to be kept roughly the same forevery node, it is likely that the joining node will require transferof a small number of buckets from every current cluster member i.e.we are starting a bucket rearrangement that will involve everycluster member and only need be done if the join is successful. So,although we wish to avoid an SPoF, if that SPoF turns out to be thejoining node, then I don't see it as a problem, If the node joiningdies, then we no longer have to worry about rearranging our buckets(unless we have lost some that had already been transferred - see(3)). Thus the joining node may be used as a singleCoordinator/Leader for this negotiation without fear of the SPoFproblem. Are we on the same page here ?
2) The same argument may be applied in reverse to a node leaving ina controlled fashion. It will wish to evacuate its buckets roughlyequally to all remaining cluster members. If it shuts down cleanly,this would form part of its shutdown protocol. If it dies before orduring the execution of this protocol then we are back at (3), ifnot, then the SPoF issue may again be put to one side.
3) This is where things get tricky :-) Currently WADI has, for thesake of simplicity, one single algorithm / thread / point-of-failurewhich recalculates a complete bucket arrangement if it detects (1),(2) or (3). It would be simple enough to offload the work done for(1) and (2) to the node joining/leaving and this should reducewadi's current vulnerability, but we still need to deal withcatastrophic failure. Currently WADI rebuilds the missing buckets byquerying the cluster for the locations of any sessions that fallwithin them, but it could equally carry a replicated backup and dustit off as part of this procedure. It's just a trade-off between workdone up front and work done in exceptional circumstance... This isthe place where the Paxos algorithm may come in handy - bucetrecomposition and rearrangement. I need to give this furtherthought. For the immediate future, however, I think WADI will staywith a single Coordinator in this situation, which fails-over ifhttp://activecluster.codehaus.org says it should - I'm delegatingthe really thorny problem to James :-). I agree with you that thisis an SPoF and that WADI's ability to recover from failure heredepends directly on how we decide if a node is alive or dead - avery tricky thing to do.
In conclusion then, I think that we have usefully identified aweakness that will become more relevant as the rest of WADI'sfeatures mature. The Lampson paper mentioned describes an algorithmfor allowing nodes to reach a consensus on actions to be performed,in a redundant manner with no SPoF and I shall consider how thismight replace WADI's currently single Coordintor, whilst alsolooking at performing other Coordination on joining/leaving nodeswhere its failure, coinciding with that of its host node, will beirrelevant, since the very condition that it was intended to resolvehas ceased to exist.
How does that sound, Andy ? Do you agree with my thoughts on (1) &(2) ? This is great input - thanks,
Jules


Jules Gosnell wrote:
Andy Piper wrote:
Hi Jules

At 05:37 AM 7/27/2005, Jules Gosnell wrote:
I agree on the SPoF thing - but I think you misunderstand myCoordinator arch. I do not have a single static Coordinator node,but a dynamic Coordinator role, into which a node may be elected.Thus every node is a potential Coordinator. If the electedCoordinator dies, another is immediately elected. The electionstrategy is pluggable, although it will probably end up beinghardwired to "oldest-cluster-member". The reason behind this isthat relaying out your cluster is much simpler if it is done in asingle vm. I originally tried to do it in multiple vms, eachtaking responsibility for pieces of the cluster, but if the vmsviews are not completely in sync, things get very hairy, andcompletely in sync is an expensive thing to achieve - and wouldintroduce a cluster-wide single point of contention. So I do itin a single vm, as fast as I can, with fail over, in case that vmevaporates. Does that sound better than the scenario that you hadin mind ?
This is exactly the "hard" computer science problem that youshouldn't be trying to solve if at all possible. Its hard becausenetwork partitions or hung processes (think GC) make it very easyfor your colleagues to think you are dead when you do not sharethat view. The result is two processes who think they are thecoordinator and anarchy can ensue (commonly called split-brainsyndrome). I can point you at papers if you want, but I reallysuggest that you aim for an implementation that is independent ofa central coordinator. Note that a central coordinator isnecessary if you want to implement a strongly-consistent in-memorydatabase, but this is not usually a requirement for sessionreplication say.
http://research.microsoft.com/Lampson/58-Consensus/Abstract.htmlgives a good introduction to some of these things. I alsopresented at JavaOne on related issues, you should be able todownload the presentation from dev2dev.bea.com at some point (notthere yet - I just checked).
OK - I will have a look at these papers and reconsider... perhaps Ican come up with some sort of fractal algorithm which recursivelybreaks down the cluster into subclusters each of which is capableof doing likewise to itself and then layout the bucketsrecursively via this metaphor... - this would be much more robust,as you point out, but, I think, a more complicated architecture. Iwill give it some serious thought. Have you any suggestions/papersas to how you might do something like this in a distributed manner,bearing in mind that as a node joins, some existing nodes will seeit as having joined and some will not yet have noticed andvice-versa on leaving....
The Coordinator is not there to support session replication, butrather the management of the distributed map (map of which a fewbuckets live on each node) which is used by WADI to discover veryefficiently whether a session exists and where it is located.This map must be rearranged, in the most efficient way possible,each time a node joins or leaves the cluster.
Understood. Once you have a fault-tolerant singleton coordinatoryou can solve lots of interesting problems, its just hard andoften not worth the effort or the expense (typical implementationsinvolve HA HW or an HA DB or at least 3 server processes).
Since I am only currently using the singleton coordinator forbucket arrangement, I may just live with it for the moment, inorder to move forward, but make a note to replace it and startbackground threads on how that might be achieved...
Replication is NYI - but I'm running a few mental backgroundthreads that suggest that an extension to the index will meanthat it associates the session's id not just to its currentlocation, but also to the location of a number of replicants. Ialso have ideas on how a session might choose nodes into which itwill place its replicants and how I can avoid the primary sessioncopy ever being colocated with a replicant (potential SPoF - ifyou only have one replicant), etc...
Right definitely something you want to avoid.
Yes, I can see that happening - I have an improvement (NYI) toWADI's evacuation strategy (how sessions are evacuated when anode wishes to leave). Each session will be evacuated to the nodewhich owns the bucket into which its id hashes. This is becausecolocation of the session with the bucket allows many messagesconcered with its future destruction and relocation to beoptimised away. Future requests falling elsewhere but needingthis session should, in the most efficient case, be relocated tothis same node, other wise the session may be relocated, but at acost...
How do you relocate the request? Many HW load-balancers do notsupport this (or else it requires using proprietary APIs), so youprobably have to count on
moving sessions in the normal failover case.
If I can squeeze the behaviour that I require out of theload-balancer, then, depending on the request type I may be able toget away with a redirection with a changed session cookie or urlparam, or, failing this an http-proxy, across from a filter abovethe servlet on one side to the http-port on the node that owns thesession...
The LB-integration object is pluggable and the aim is to supplywadi with a good selection of LB integrations - currently I onlyhave a ModJK[2] plugin working. This is able to 'restick' clientsto their session's new location (although messing with the sessionid is a little dodgy...).
I would be very grateful in any thoughts or feedback that youcould give me. I hope to get much more information about WADIinto the wiki over the next few weeks. That should help generatemore discussion, although I would be more than happy for peopleto ask me questions here on Geronimo-dev because this will giveme an idea of what documentation I should write and how existingdocumentation may be lacking or misleading.
I guess my general comment would be that you might find it betterto think specifically about the end-user problem you are trying tosolve (say session replication) and work towards a solution basedon that. Most short-cuts / optimizations that vendors make arespecific to the problem domain and do not generally apply to allclustering problems.
The end problem is really clustered web and ejb sessions at themoment, although it looks as if by the time we have solved theseissues we may well have written a fault-tolerantdistributed/partitioned index that might be very useful as ageneric distributed cache building block.
One thing that I do want wadi to do, is to still work whenreplication is switched off. i.e., if a session only exists as aprimary copy, even if affinity breaks down, wadi will continue tocorrectly render requests for that session unless some form ofcatastrophic failure causes the session to evaporate. This meansthat I need to ensure the session's timely evacuation from a nodethat chooses to leave the cluster to a remaining node, so that itmay remain active beyond the lifetime of its original node. All ofthis must work flawlessly under stress, so that an admin may add orremove nodes to a running cluster without having to worry about theuser state that it is managing. Nodes are added by simply startingthem, and nodes removed via e.g. ctl-c-ing them.
If it is decided that a few more nines are needed in terms ofsession availability and the cluster owner understands the extracost involved in in-vm replication in terms of extra hardware andbandwidth that they will have to purchase and is happy to go within-vm-replication, then it should be sufficient to up the number ofreplicated copies kept by the cluster from '0' to e.g. '2' andrestart (It might even be possible to vary this setting on a nodeto node basis so that this change does not even involve a completecluster cold start). WADI should deal with the rest.
So, I believe that I have a pretty clear idea of what WADI will do,and aside from the replication stuff (phase2) it currently doesmost of what iIhad in mind for phase1, except that it is not yethappy under stress. I figure it will probably take one or two moreredesign/reimplementation iterations to get it to this stage, thenI can consider replication.
I have spoken to members of the OpenEJB team about wadi's abilityto relocate requests as well as sessions and we came to theconclusion that it was just as applicable in the EJB world as theweb world. If the node an ejb client is talking to leaves thecluster in between calls, the client may try to contact it and thenfailover to another node that it hopes holds the session. If, dueto other nodes leaving/joining it is not always clear which nodewill contain the session, the ability to reply to an RMI and justsay "not here - there!" - i.e. an rmi redirection - would not behard to add and would resolve this situation. Transactions areanother item which I have marked phase2.
So, I am trying hard to stay very focussed on the problem domain,otherwise this will never get finished :-)
Right, off to read those papers now - thanks for your posting andyour interest,
Jules
Hope this helps
andy

--

Joe Bohn[EMAIL PROTECTED]"He is no fool who gives what he cannot keep, to gain what he cannot lose." -- Jim Elliot

Re: Clustering (long)

Reply via email to