I've had a look at the Lampson paper, but didn't take it all in on the
first pass - I think it will need some serious concentration. The Paxos
algorithm looks interesting, I will definitely pursue this avenue.
I've also given a little thought to exactly why I need a Coordinator and
how Paxos might be used to replace it. My use of a Coordinator and plans
for its future do not actually seem that far from Paxos, on a
preliminary reading.
Given that WADI currently uses a distributed map of
sessionId:sessionLocation, that this distribution is achieved by sharing
out responsibility for the set number of buckets that comprise the map
roughly evenly between the cluster members and that this is currently my
most satisfying design, I can break my problem space (for bucket
arrangement) down into 3 basic cases :
1) Node joins
2) Node leaves in controlled fashion
3) Node dies
If the node under discussion is the only cluster member, then no bucket
rearrangement is necessary - this node will either create or destroy the
full set of buckets. I'll leave this set of subcases as trivial.
1) The joining node will need to assume responsibility for a number of
buckets. If buckets-per-node is to be kept roughly the same for every
node, it is likely that the joining node will require transfer of a
small number of buckets from every current cluster member i.e. we are
starting a bucket rearrangement that will involve every cluster member
and only need be done if the join is successful. So, although we wish to
avoid an SPoF, if that SPoF turns out to be the joining node, then I
don't see it as a problem, If the node joining dies, then we no longer
have to worry about rearranging our buckets (unless we have lost some
that had already been transferred - see (3)). Thus the joining node may
be used as a single Coordinator/Leader for this negotiation without fear
of the SPoF problem. Are we on the same page here ?
2) The same argument may be applied in reverse to a node leaving in a
controlled fashion. It will wish to evacuate its buckets roughly equally
to all remaining cluster members. If it shuts down cleanly, this would
form part of its shutdown protocol. If it dies before or during the
execution of this protocol then we are back at (3), if not, then the
SPoF issue may again be put to one side.
3) This is where things get tricky :-) Currently WADI has, for the sake
of simplicity, one single algorithm / thread / point-of-failure which
recalculates a complete bucket arrangement if it detects (1), (2) or
(3). It would be simple enough to offload the work done for (1) and (2)
to the node joining/leaving and this should reduce wadi's current
vulnerability, but we still need to deal with catastrophic failure.
Currently WADI rebuilds the missing buckets by querying the cluster for
the locations of any sessions that fall within them, but it could
equally carry a replicated backup and dust it off as part of this
procedure. It's just a trade-off between work done up front and work
done in exceptional circumstance... This is the place where the Paxos
algorithm may come in handy - bucet recomposition and rearrangement. I
need to give this further thought. For the immediate future, however, I
think WADI will stay with a single Coordinator in this situation, which
fails-over if http://activecluster.codehaus.org says it should - I'm
delegating the really thorny problem to James :-). I agree with you that
this is an SPoF and that WADI's ability to recover from failure here
depends directly on how we decide if a node is alive or dead - a very
tricky thing to do.
In conclusion then, I think that we have usefully identified a weakness
that will become more relevant as the rest of WADI's features mature.
The Lampson paper mentioned describes an algorithm for allowing nodes to
reach a consensus on actions to be performed, in a redundant manner with
no SPoF and I shall consider how this might replace WADI's currently
single Coordintor, whilst also looking at performing other Coordination
on joining/leaving nodes where its failure, coinciding with that of its
host node, will be irrelevant, since the very condition that it was
intended to resolve has ceased to exist.
How does that sound, Andy ? Do you agree with my thoughts on (1) & (2) ?
This is great input - thanks,
Jules
Jules Gosnell wrote:
Andy Piper wrote:
Hi Jules
At 05:37 AM 7/27/2005, Jules Gosnell wrote:
I agree on the SPoF thing - but I think you misunderstand my
Coordinator arch. I do not have a single static Coordinator node,
but a dynamic Coordinator role, into which a node may be elected.
Thus every node is a potential Coordinator. If the elected
Coordinator dies, another is immediately elected. The election
strategy is pluggable, although it will probably end up being
hardwired to "oldest-cluster-member". The reason behind this is that
relaying out your cluster is much simpler if it is done in a single
vm. I originally tried to do it in multiple vms, each taking
responsibility for pieces of the cluster, but if the vms views are
not completely in sync, things get very hairy, and completely in
sync is an expensive thing to achieve - and would introduce a
cluster-wide single point of contention. So I do it in a single vm,
as fast as I can, with fail over, in case that vm evaporates. Does
that sound better than the scenario that you had in mind ?
This is exactly the "hard" computer science problem that you
shouldn't be trying to solve if at all possible. Its hard because
network partitions or hung processes (think GC) make it very easy for
your colleagues to think you are dead when you do not share that
view. The result is two processes who think they are the coordinator
and anarchy can ensue (commonly called split-brain syndrome). I can
point you at papers if you want, but I really suggest that you aim
for an implementation that is independent of a central coordinator.
Note that a central coordinator is necessary if you want to implement
a strongly-consistent in-memory database, but this is not usually a
requirement for session replication say.
http://research.microsoft.com/Lampson/58-Consensus/Abstract.html
gives a good introduction to some of these things. I also presented
at JavaOne on related issues, you should be able to download the
presentation from dev2dev.bea.com at some point (not there yet - I
just checked).
OK - I will have a look at these papers and reconsider... perhaps I
can come up with some sort of fractal algorithm which recursively
breaks down the cluster into subclusters each of which is capable of
doing likewise to itself and then layout the buckets recursively via
this metaphor... - this would be much more robust, as you point out,
but, I think, a more complicated architecture. I will give it some
serious thought. Have you any suggestions/papers as to how you might
do something like this in a distributed manner, bearing in mind that
as a node joins, some existing nodes will see it as having joined and
some will not yet have noticed and vice-versa on leaving....
The Coordinator is not there to support session replication, but
rather the management of the distributed map (map of which a few
buckets live on each node) which is used by WADI to discover very
efficiently whether a session exists and where it is located. This
map must be rearranged, in the most efficient way possible, each
time a node joins or leaves the cluster.
Understood. Once you have a fault-tolerant singleton coordinator you
can solve lots of interesting problems, its just hard and often not
worth the effort or the expense (typical implementations involve HA
HW or an HA DB or at least 3 server processes).
Since I am only currently using the singleton coordinator for bucket
arrangement, I may just live with it for the moment, in order to move
forward, but make a note to replace it and start background threads on
how that might be achieved...
Replication is NYI - but I'm running a few mental background threads
that suggest that an extension to the index will mean that it
associates the session's id not just to its current location, but
also to the location of a number of replicants. I also have ideas on
how a session might choose nodes into which it will place its
replicants and how I can avoid the primary session copy ever being
colocated with a replicant (potential SPoF - if you only have one
replicant), etc...
Right definitely something you want to avoid.
Yes, I can see that happening - I have an improvement (NYI) to
WADI's evacuation strategy (how sessions are evacuated when a node
wishes to leave). Each session will be evacuated to the node which
owns the bucket into which its id hashes. This is because colocation
of the session with the bucket allows many messages concered with
its future destruction and relocation to be optimised away. Future
requests falling elsewhere but needing this session should, in the
most efficient case, be relocated to this same node, other wise the
session may be relocated, but at a cost...
How do you relocate the request? Many HW load-balancers do not
support this (or else it requires using proprietary APIs), so you
probably have to count on
moving sessions in the normal failover case.
If I can squeeze the behaviour that I require out of the
load-balancer, then, depending on the request type I may be able to
get away with a redirection with a changed session cookie or url
param, or, failing this an http-proxy, across from a filter above the
servlet on one side to the http-port on the node that owns the session...
The LB-integration object is pluggable and the aim is to supply wadi
with a good selection of LB integrations - currently I only have a
ModJK[2] plugin working. This is able to 'restick' clients to their
session's new location (although messing with the session id is a
little dodgy...).
I would be very grateful in any thoughts or feedback that you could
give me. I hope to get much more information about WADI into the
wiki over the next few weeks. That should help generate more
discussion, although I would be more than happy for people to ask me
questions here on Geronimo-dev because this will give me an idea of
what documentation I should write and how existing documentation may
be lacking or misleading.
I guess my general comment would be that you might find it better to
think specifically about the end-user problem you are trying to solve
(say session replication) and work towards a solution based on that.
Most short-cuts / optimizations that vendors make are specific to the
problem domain and do not generally apply to all clustering problems.
The end problem is really clustered web and ejb sessions at the
moment, although it looks as if by the time we have solved these
issues we may well have written a fault-tolerant
distributed/partitioned index that might be very useful as a generic
distributed cache building block.
One thing that I do want wadi to do, is to still work when replication
is switched off. i.e., if a session only exists as a primary copy,
even if affinity breaks down, wadi will continue to correctly render
requests for that session unless some form of catastrophic failure
causes the session to evaporate. This means that I need to ensure the
session's timely evacuation from a node that chooses to leave the
cluster to a remaining node, so that it may remain active beyond the
lifetime of its original node. All of this must work flawlessly under
stress, so that an admin may add or remove nodes to a running cluster
without having to worry about the user state that it is managing.
Nodes are added by simply starting them, and nodes removed via e.g.
ctl-c-ing them.
If it is decided that a few more nines are needed in terms of session
availability and the cluster owner understands the extra cost involved
in in-vm replication in terms of extra hardware and bandwidth that
they will have to purchase and is happy to go with in-vm-replication,
then it should be sufficient to up the number of replicated copies
kept by the cluster from '0' to e.g. '2' and restart (It might even be
possible to vary this setting on a node to node basis so that this
change does not even involve a complete cluster cold start). WADI
should deal with the rest.
So, I believe that I have a pretty clear idea of what WADI will do,
and aside from the replication stuff (phase2) it currently does most
of what iIhad in mind for phase1, except that it is not yet happy
under stress. I figure it will probably take one or two more
redesign/reimplementation iterations to get it to this stage, then I
can consider replication.
I have spoken to members of the OpenEJB team about wadi's ability to
relocate requests as well as sessions and we came to the conclusion
that it was just as applicable in the EJB world as the web world. If
the node an ejb client is talking to leaves the cluster in between
calls, the client may try to contact it and then failover to another
node that it hopes holds the session. If, due to other nodes
leaving/joining it is not always clear which node will contain the
session, the ability to reply to an RMI and just say "not here -
there!" - i.e. an rmi redirection - would not be hard to add and would
resolve this situation. Transactions are another item which I have
marked phase2.
So, I am trying hard to stay very focussed on the problem domain,
otherwise this will never get finished :-)
Right, off to read those papers now - thanks for your posting and your
interest,
Jules
Hope this helps
andy
--
"Open Source is a self-assembling organism. You dangle a piece of
string into a super-saturated solution and a whole operating-system
crystallises out around it."
/**********************************
* Jules Gosnell
* Partner
* Core Developers Network (Europe)
*
* www.coredevelopers.net
*
* Open Source Training & Support.
**********************************/