Andy Piper wrote:
Hi Jules
At 05:37 AM 7/27/2005, Jules Gosnell wrote:
I agree on the SPoF thing - but I think you misunderstand my
Coordinator arch. I do not have a single static Coordinator node, but
a dynamic Coordinator role, into which a node may be elected. Thus
every node is a potential Coordinator. If the elected Coordinator
dies, another is immediately elected. The election strategy is
pluggable, although it will probably end up being hardwired to
"oldest-cluster-member". The reason behind this is that relaying out
your cluster is much simpler if it is done in a single vm. I
originally tried to do it in multiple vms, each taking responsibility
for pieces of the cluster, but if the vms views are not completely in
sync, things get very hairy, and completely in sync is an expensive
thing to achieve - and would introduce a cluster-wide single point of
contention. So I do it in a single vm, as fast as I can, with fail
over, in case that vm evaporates. Does that sound better than the
scenario that you had in mind ?
This is exactly the "hard" computer science problem that you shouldn't
be trying to solve if at all possible. Its hard because network
partitions or hung processes (think GC) make it very easy for your
colleagues to think you are dead when you do not share that view. The
result is two processes who think they are the coordinator and anarchy
can ensue (commonly called split-brain syndrome). I can point you at
papers if you want, but I really suggest that you aim for an
implementation that is independent of a central coordinator. Note that
a central coordinator is necessary if you want to implement a
strongly-consistent in-memory database, but this is not usually a
requirement for session replication say.
http://research.microsoft.com/Lampson/58-Consensus/Abstract.html gives
a good introduction to some of these things. I also presented at
JavaOne on related issues, you should be able to download the
presentation from dev2dev.bea.com at some point (not there yet - I
just checked).
OK - I will have a look at these papers and reconsider... perhaps I can
come up with some sort of fractal algorithm which recursively breaks
down the cluster into subclusters each of which is capable of doing
likewise to itself and then layout the buckets recursively via this
metaphor... - this would be much more robust, as you point out, but, I
think, a more complicated architecture. I will give it some serious
thought. Have you any suggestions/papers as to how you might do
something like this in a distributed manner, bearing in mind that as a
node joins, some existing nodes will see it as having joined and some
will not yet have noticed and vice-versa on leaving....
The Coordinator is not there to support session replication, but
rather the management of the distributed map (map of which a few
buckets live on each node) which is used by WADI to discover very
efficiently whether a session exists and where it is located. This
map must be rearranged, in the most efficient way possible, each time
a node joins or leaves the cluster.
Understood. Once you have a fault-tolerant singleton coordinator you
can solve lots of interesting problems, its just hard and often not
worth the effort or the expense (typical implementations involve HA HW
or an HA DB or at least 3 server processes).
Since I am only currently using the singleton coordinator for bucket
arrangement, I may just live with it for the moment, in order to move
forward, but make a note to replace it and start background threads on
how that might be achieved...
Replication is NYI - but I'm running a few mental background threads
that suggest that an extension to the index will mean that it
associates the session's id not just to its current location, but
also to the location of a number of replicants. I also have ideas on
how a session might choose nodes into which it will place its
replicants and how I can avoid the primary session copy ever being
colocated with a replicant (potential SPoF - if you only have one
replicant), etc...
Right definitely something you want to avoid.
Yes, I can see that happening - I have an improvement (NYI) to WADI's
evacuation strategy (how sessions are evacuated when a node wishes to
leave). Each session will be evacuated to the node which owns the
bucket into which its id hashes. This is because colocation of the
session with the bucket allows many messages concered with its future
destruction and relocation to be optimised away. Future requests
falling elsewhere but needing this session should, in the most
efficient case, be relocated to this same node, other wise the
session may be relocated, but at a cost...
How do you relocate the request? Many HW load-balancers do not support
this (or else it requires using proprietary APIs), so you probably
have to count on
moving sessions in the normal failover case.
If I can squeeze the behaviour that I require out of the load-balancer,
then, depending on the request type I may be able to get away with a
redirection with a changed session cookie or url param, or, failing this
an http-proxy, across from a filter above the servlet on one side to the
http-port on the node that owns the session...
The LB-integration object is pluggable and the aim is to supply wadi
with a good selection of LB integrations - currently I only have a
ModJK[2] plugin working. This is able to 'restick' clients to their
session's new location (although messing with the session id is a little
dodgy...).
I would be very grateful in any thoughts or feedback that you could
give me. I hope to get much more information about WADI into the wiki
over the next few weeks. That should help generate more discussion,
although I would be more than happy for people to ask me questions
here on Geronimo-dev because this will give me an idea of what
documentation I should write and how existing documentation may be
lacking or misleading.
I guess my general comment would be that you might find it better to
think specifically about the end-user problem you are trying to solve
(say session replication) and work towards a solution based on that.
Most short-cuts / optimizations that vendors make are specific to the
problem domain and do not generally apply to all clustering problems.
The end problem is really clustered web and ejb sessions at the moment,
although it looks as if by the time we have solved these issues we may
well have written a fault-tolerant distributed/partitioned index that
might be very useful as a generic distributed cache building block.
One thing that I do want wadi to do, is to still work when replication
is switched off. i.e., if a session only exists as a primary copy, even
if affinity breaks down, wadi will continue to correctly render requests
for that session unless some form of catastrophic failure causes the
session to evaporate. This means that I need to ensure the session's
timely evacuation from a node that chooses to leave the cluster to a
remaining node, so that it may remain active beyond the lifetime of its
original node. All of this must work flawlessly under stress, so that an
admin may add or remove nodes to a running cluster without having to
worry about the user state that it is managing. Nodes are added by
simply starting them, and nodes removed via e.g. ctl-c-ing them.
If it is decided that a few more nines are needed in terms of session
availability and the cluster owner understands the extra cost involved
in in-vm replication in terms of extra hardware and bandwidth that they
will have to purchase and is happy to go with in-vm-replication, then it
should be sufficient to up the number of replicated copies kept by the
cluster from '0' to e.g. '2' and restart (It might even be possible to
vary this setting on a node to node basis so that this change does not
even involve a complete cluster cold start). WADI should deal with the rest.
So, I believe that I have a pretty clear idea of what WADI will do, and
aside from the replication stuff (phase2) it currently does most of what
iIhad in mind for phase1, except that it is not yet happy under stress.
I figure it will probably take one or two more redesign/reimplementation
iterations to get it to this stage, then I can consider replication.
I have spoken to members of the OpenEJB team about wadi's ability to
relocate requests as well as sessions and we came to the conclusion that
it was just as applicable in the EJB world as the web world. If the node
an ejb client is talking to leaves the cluster in between calls, the
client may try to contact it and then failover to another node that it
hopes holds the session. If, due to other nodes leaving/joining it is
not always clear which node will contain the session, the ability to
reply to an RMI and just say "not here - there!" - i.e. an rmi
redirection - would not be hard to add and would resolve this situation.
Transactions are another item which I have marked phase2.
So, I am trying hard to stay very focussed on the problem domain,
otherwise this will never get finished :-)
Right, off to read those papers now - thanks for your posting and your
interest,
Jules
Hope this helps
andy
--
"Open Source is a self-assembling organism. You dangle a piece of
string into a super-saturated solution and a whole operating-system
crystallises out around it."
/**********************************
* Jules Gosnell
* Partner
* Core Developers Network (Europe)
*
* www.coredevelopers.net
*
* Open Source Training & Support.
**********************************/