You can define an order to the semaphores when locking and thereby avoid
a deadlock. If each node being added or terminating itself honors the
order then you will never have a deadlock. However, you still need to
deal with the case of an uncontrolled failure either adding or removing
a note and possibly never releasing a lock.
Joe
Jules Gosnell wrote:
hmm... hmmm... :-)
more thoughts on (1) and (2)...
When a node leaves/joins it needs to acquire a lease on the bucket
tables of every node that it intends to move buckets from/to. If two
nodes are doing this at the same time, their requirement will collide
(deadlock) somewhere in the cluster. At this point they may be
notified and e.g. compare ip addresses to decide who continues and who
backs off for a while.
So, (1) and (2), whilst being possible are probably more complex than
I initially imagined. If we have Paxos for the more general purpose
case (3) anyway, it would probably be smart just to go with this,
until such optimisations becomes necessary, if at all.
Jules
Jules Gosnell wrote:
hmmm...
now I'm wondering about my solutions to (1) and (2) - if more than
one node tries to join or leave at the same time I may be in trouble
- so it may be safer to go straight to (3) for all cases...
more thought needed :-)
Jules
Jules Gosnell wrote:
I've had a look at the Lampson paper, but didn't take it all in on
the first pass - I think it will need some serious concentration.
The Paxos algorithm looks interesting, I will definitely pursue this
avenue.
I've also given a little thought to exactly why I need a Coordinator
and how Paxos might be used to replace it. My use of a Coordinator
and plans for its future do not actually seem that far from Paxos,
on a preliminary reading.
Given that WADI currently uses a distributed map of
sessionId:sessionLocation, that this distribution is achieved by
sharing out responsibility for the set number of buckets that
comprise the map roughly evenly between the cluster members and that
this is currently my most satisfying design, I can break my problem
space (for bucket arrangement) down into 3 basic cases :
1) Node joins
2) Node leaves in controlled fashion
3) Node dies
If the node under discussion is the only cluster member, then no
bucket rearrangement is necessary - this node will either create or
destroy the full set of buckets. I'll leave this set of subcases as
trivial.
1) The joining node will need to assume responsibility for a number
of buckets. If buckets-per-node is to be kept roughly the same for
every node, it is likely that the joining node will require transfer
of a small number of buckets from every current cluster member i.e.
we are starting a bucket rearrangement that will involve every
cluster member and only need be done if the join is successful. So,
although we wish to avoid an SPoF, if that SPoF turns out to be the
joining node, then I don't see it as a problem, If the node joining
dies, then we no longer have to worry about rearranging our buckets
(unless we have lost some that had already been transferred - see
(3)). Thus the joining node may be used as a single
Coordinator/Leader for this negotiation without fear of the SPoF
problem. Are we on the same page here ?
2) The same argument may be applied in reverse to a node leaving in
a controlled fashion. It will wish to evacuate its buckets roughly
equally to all remaining cluster members. If it shuts down cleanly,
this would form part of its shutdown protocol. If it dies before or
during the execution of this protocol then we are back at (3), if
not, then the SPoF issue may again be put to one side.
3) This is where things get tricky :-) Currently WADI has, for the
sake of simplicity, one single algorithm / thread / point-of-failure
which recalculates a complete bucket arrangement if it detects (1),
(2) or (3). It would be simple enough to offload the work done for
(1) and (2) to the node joining/leaving and this should reduce
wadi's current vulnerability, but we still need to deal with
catastrophic failure. Currently WADI rebuilds the missing buckets by
querying the cluster for the locations of any sessions that fall
within them, but it could equally carry a replicated backup and dust
it off as part of this procedure. It's just a trade-off between work
done up front and work done in exceptional circumstance... This is
the place where the Paxos algorithm may come in handy - bucet
recomposition and rearrangement. I need to give this further
thought. For the immediate future, however, I think WADI will stay
with a single Coordinator in this situation, which fails-over if
http://activecluster.codehaus.org says it should - I'm delegating
the really thorny problem to James :-). I agree with you that this
is an SPoF and that WADI's ability to recover from failure here
depends directly on how we decide if a node is alive or dead - a
very tricky thing to do.
In conclusion then, I think that we have usefully identified a
weakness that will become more relevant as the rest of WADI's
features mature. The Lampson paper mentioned describes an algorithm
for allowing nodes to reach a consensus on actions to be performed,
in a redundant manner with no SPoF and I shall consider how this
might replace WADI's currently single Coordintor, whilst also
looking at performing other Coordination on joining/leaving nodes
where its failure, coinciding with that of its host node, will be
irrelevant, since the very condition that it was intended to resolve
has ceased to exist.
How does that sound, Andy ? Do you agree with my thoughts on (1) &
(2) ? This is great input - thanks,
Jules
Jules Gosnell wrote:
Andy Piper wrote:
Hi Jules
At 05:37 AM 7/27/2005, Jules Gosnell wrote:
I agree on the SPoF thing - but I think you misunderstand my
Coordinator arch. I do not have a single static Coordinator node,
but a dynamic Coordinator role, into which a node may be elected.
Thus every node is a potential Coordinator. If the elected
Coordinator dies, another is immediately elected. The election
strategy is pluggable, although it will probably end up being
hardwired to "oldest-cluster-member". The reason behind this is
that relaying out your cluster is much simpler if it is done in a
single vm. I originally tried to do it in multiple vms, each
taking responsibility for pieces of the cluster, but if the vms
views are not completely in sync, things get very hairy, and
completely in sync is an expensive thing to achieve - and would
introduce a cluster-wide single point of contention. So I do it
in a single vm, as fast as I can, with fail over, in case that vm
evaporates. Does that sound better than the scenario that you had
in mind ?
This is exactly the "hard" computer science problem that you
shouldn't be trying to solve if at all possible. Its hard because
network partitions or hung processes (think GC) make it very easy
for your colleagues to think you are dead when you do not share
that view. The result is two processes who think they are the
coordinator and anarchy can ensue (commonly called split-brain
syndrome). I can point you at papers if you want, but I really
suggest that you aim for an implementation that is independent of
a central coordinator. Note that a central coordinator is
necessary if you want to implement a strongly-consistent in-memory
database, but this is not usually a requirement for session
replication say.
http://research.microsoft.com/Lampson/58-Consensus/Abstract.html
gives a good introduction to some of these things. I also
presented at JavaOne on related issues, you should be able to
download the presentation from dev2dev.bea.com at some point (not
there yet - I just checked).
OK - I will have a look at these papers and reconsider... perhaps I
can come up with some sort of fractal algorithm which recursively
breaks down the cluster into subclusters each of which is capable
of doing likewise to itself and then layout the buckets
recursively via this metaphor... - this would be much more robust,
as you point out, but, I think, a more complicated architecture. I
will give it some serious thought. Have you any suggestions/papers
as to how you might do something like this in a distributed manner,
bearing in mind that as a node joins, some existing nodes will see
it as having joined and some will not yet have noticed and
vice-versa on leaving....
The Coordinator is not there to support session replication, but
rather the management of the distributed map (map of which a few
buckets live on each node) which is used by WADI to discover very
efficiently whether a session exists and where it is located.
This map must be rearranged, in the most efficient way possible,
each time a node joins or leaves the cluster.
Understood. Once you have a fault-tolerant singleton coordinator
you can solve lots of interesting problems, its just hard and
often not worth the effort or the expense (typical implementations
involve HA HW or an HA DB or at least 3 server processes).
Since I am only currently using the singleton coordinator for
bucket arrangement, I may just live with it for the moment, in
order to move forward, but make a note to replace it and start
background threads on how that might be achieved...
Replication is NYI - but I'm running a few mental background
threads that suggest that an extension to the index will mean
that it associates the session's id not just to its current
location, but also to the location of a number of replicants. I
also have ideas on how a session might choose nodes into which it
will place its replicants and how I can avoid the primary session
copy ever being colocated with a replicant (potential SPoF - if
you only have one replicant), etc...
Right definitely something you want to avoid.
Yes, I can see that happening - I have an improvement (NYI) to
WADI's evacuation strategy (how sessions are evacuated when a
node wishes to leave). Each session will be evacuated to the node
which owns the bucket into which its id hashes. This is because
colocation of the session with the bucket allows many messages
concered with its future destruction and relocation to be
optimised away. Future requests falling elsewhere but needing
this session should, in the most efficient case, be relocated to
this same node, other wise the session may be relocated, but at a
cost...
How do you relocate the request? Many HW load-balancers do not
support this (or else it requires using proprietary APIs), so you
probably have to count on
moving sessions in the normal failover case.
If I can squeeze the behaviour that I require out of the
load-balancer, then, depending on the request type I may be able to
get away with a redirection with a changed session cookie or url
param, or, failing this an http-proxy, across from a filter above
the servlet on one side to the http-port on the node that owns the
session...
The LB-integration object is pluggable and the aim is to supply
wadi with a good selection of LB integrations - currently I only
have a ModJK[2] plugin working. This is able to 'restick' clients
to their session's new location (although messing with the session
id is a little dodgy...).
I would be very grateful in any thoughts or feedback that you
could give me. I hope to get much more information about WADI
into the wiki over the next few weeks. That should help generate
more discussion, although I would be more than happy for people
to ask me questions here on Geronimo-dev because this will give
me an idea of what documentation I should write and how existing
documentation may be lacking or misleading.
I guess my general comment would be that you might find it better
to think specifically about the end-user problem you are trying to
solve (say session replication) and work towards a solution based
on that. Most short-cuts / optimizations that vendors make are
specific to the problem domain and do not generally apply to all
clustering problems.
The end problem is really clustered web and ejb sessions at the
moment, although it looks as if by the time we have solved these
issues we may well have written a fault-tolerant
distributed/partitioned index that might be very useful as a
generic distributed cache building block.
One thing that I do want wadi to do, is to still work when
replication is switched off. i.e., if a session only exists as a
primary copy, even if affinity breaks down, wadi will continue to
correctly render requests for that session unless some form of
catastrophic failure causes the session to evaporate. This means
that I need to ensure the session's timely evacuation from a node
that chooses to leave the cluster to a remaining node, so that it
may remain active beyond the lifetime of its original node. All of
this must work flawlessly under stress, so that an admin may add or
remove nodes to a running cluster without having to worry about the
user state that it is managing. Nodes are added by simply starting
them, and nodes removed via e.g. ctl-c-ing them.
If it is decided that a few more nines are needed in terms of
session availability and the cluster owner understands the extra
cost involved in in-vm replication in terms of extra hardware and
bandwidth that they will have to purchase and is happy to go with
in-vm-replication, then it should be sufficient to up the number of
replicated copies kept by the cluster from '0' to e.g. '2' and
restart (It might even be possible to vary this setting on a node
to node basis so that this change does not even involve a complete
cluster cold start). WADI should deal with the rest.
So, I believe that I have a pretty clear idea of what WADI will do,
and aside from the replication stuff (phase2) it currently does
most of what iIhad in mind for phase1, except that it is not yet
happy under stress. I figure it will probably take one or two more
redesign/reimplementation iterations to get it to this stage, then
I can consider replication.
I have spoken to members of the OpenEJB team about wadi's ability
to relocate requests as well as sessions and we came to the
conclusion that it was just as applicable in the EJB world as the
web world. If the node an ejb client is talking to leaves the
cluster in between calls, the client may try to contact it and then
failover to another node that it hopes holds the session. If, due
to other nodes leaving/joining it is not always clear which node
will contain the session, the ability to reply to an RMI and just
say "not here - there!" - i.e. an rmi redirection - would not be
hard to add and would resolve this situation. Transactions are
another item which I have marked phase2.
So, I am trying hard to stay very focussed on the problem domain,
otherwise this will never get finished :-)
Right, off to read those papers now - thanks for your posting and
your interest,
Jules
Hope this helps
andy
--
Joe Bohn
[EMAIL PROTECTED]
"He is no fool who gives what he cannot keep, to gain what he cannot lose." -- Jim Elliot