Re: Clustering (long)

2005-08-03 Thread Jules Gosnell

hmm... hmmm... :-)

more thoughts on (1) and (2)...

When a node leaves/joins it needs to acquire a lease on the bucket 
tables of every node that it intends to move buckets from/to. If two 
nodes are doing this at the same time, their requirement will collide 
(deadlock) somewhere in the cluster. At this point they may be notified 
and e.g. compare ip addresses to decide who continues and who backs off 
for a while.


So, (1) and (2), whilst being possible are probably more complex than I 
initially imagined. If we have Paxos for the more general purpose case 
(3) anyway, it would probably be smart just to go with this, until such 
optimisations becomes necessary, if at all.


Jules


Jules Gosnell wrote:


hmmm...

now I'm wondering about my solutions to (1) and (2) - if more than one 
node tries to join or leave at the same time I may be in trouble - so 
it may be safer to go straight to (3) for all cases...


more thought needed :-)

Jules



Jules Gosnell wrote:

I've had a look at the Lampson paper, but didn't take it all in on 
the first pass - I think it will need some serious concentration. The 
Paxos algorithm looks interesting, I will definitely pursue this avenue.


I've also given a little thought to exactly why I need a Coordinator 
and how Paxos might be used to replace it. My use of a Coordinator 
and plans for its future do not actually seem that far from Paxos, on 
a preliminary reading.


Given that WADI currently uses a distributed map of 
sessionId:sessionLocation, that this distribution is achieved by 
sharing out responsibility for the set number of buckets that 
comprise the map roughly evenly between the cluster members and that 
this is currently my most satisfying design, I can break my problem 
space (for bucket arrangement) down into 3 basic cases :


1) Node joins
2) Node leaves in controlled fashion
3) Node dies

If the node under discussion is the only cluster member, then no 
bucket rearrangement is necessary - this node will either create or 
destroy the full set of buckets. I'll leave this set of subcases as 
trivial.


1)  The joining node will need to assume responsibility for a number 
of buckets. If buckets-per-node is to be kept roughly the same for 
every node, it is likely that the joining node will require transfer 
of a small number of buckets from every current cluster member i.e. 
we are starting a bucket rearrangement that will involve every 
cluster member and only need be done if the join is successful. So, 
although we wish to avoid an SPoF, if that SPoF turns out to be the 
joining node, then I don't see it as a problem, If the node joining 
dies, then we no longer have to worry about rearranging our buckets 
(unless we have lost some that had already been transferred - see 
(3)). Thus the joining node may be used as a single 
Coordinator/Leader for this negotiation without fear of the SPoF 
problem. Are we on the same page here ?


2) The same argument may be applied in reverse to a node leaving in a 
controlled fashion. It will wish to evacuate its buckets roughly 
equally to all remaining cluster members. If it shuts down cleanly, 
this would form part of its shutdown protocol. If it dies before or 
during the execution of this protocol then we are back at (3), if 
not, then the SPoF issue may again be put to one side.


3) This is where things get tricky :-) Currently WADI has, for the 
sake of simplicity, one single algorithm / thread / point-of-failure 
which recalculates a complete bucket arrangement if it detects (1), 
(2) or (3). It would be simple enough to offload the work done for 
(1) and (2) to the node joining/leaving and this should reduce wadi's 
current vulnerability, but we still need to deal with catastrophic 
failure. Currently WADI rebuilds the missing buckets by querying the 
cluster for the locations of any sessions that fall within them, but 
it could equally carry a replicated backup and dust it off as part of 
this procedure. It's just a trade-off between work done up front and 
work done in exceptional circumstance... This is the place where the 
Paxos algorithm may come in handy - bucet recomposition and 
rearrangement. I need to give this further thought. For the immediate 
future, however, I think WADI will stay with a single Coordinator in 
this situation, which fails-over if http://activecluster.codehaus.org 
says it should - I'm delegating the really thorny problem to James 
:-). I agree with you that this is an SPoF and that WADI's ability to 
recover from failure here depends directly on how we decide if a node 
is alive or dead - a very tricky thing to do.


In conclusion then, I think that we have usefully identified a 
weakness that will become more relevant as the rest of WADI's 
features mature. The Lampson paper mentioned describes an algorithm 
for allowing nodes to reach a consensus on actions to be performed, 
in a redundant manner with no SPoF and I shall consider how this 
might replace WADI's currently 

Re: Clustering (long)

2005-08-03 Thread Joe Bohn
You can define an order to the semaphores when locking and thereby avoid 
a deadlock.  If each node being added or terminating itself honors the 
order then you will never have a deadlock.   However, you still need to 
deal with the case of an uncontrolled failure either adding or removing 
a note and possibly never releasing a lock.


Joe

Jules Gosnell wrote:


hmm... hmmm... :-)

more thoughts on (1) and (2)...

When a node leaves/joins it needs to acquire a lease on the bucket 
tables of every node that it intends to move buckets from/to. If two 
nodes are doing this at the same time, their requirement will collide 
(deadlock) somewhere in the cluster. At this point they may be 
notified and e.g. compare ip addresses to decide who continues and who 
backs off for a while.


So, (1) and (2), whilst being possible are probably more complex than 
I initially imagined. If we have Paxos for the more general purpose 
case (3) anyway, it would probably be smart just to go with this, 
until such optimisations becomes necessary, if at all.


Jules


Jules Gosnell wrote:


hmmm...

now I'm wondering about my solutions to (1) and (2) - if more than 
one node tries to join or leave at the same time I may be in trouble 
- so it may be safer to go straight to (3) for all cases...


more thought needed :-)

Jules



Jules Gosnell wrote:

I've had a look at the Lampson paper, but didn't take it all in on 
the first pass - I think it will need some serious concentration. 
The Paxos algorithm looks interesting, I will definitely pursue this 
avenue.


I've also given a little thought to exactly why I need a Coordinator 
and how Paxos might be used to replace it. My use of a Coordinator 
and plans for its future do not actually seem that far from Paxos, 
on a preliminary reading.


Given that WADI currently uses a distributed map of 
sessionId:sessionLocation, that this distribution is achieved by 
sharing out responsibility for the set number of buckets that 
comprise the map roughly evenly between the cluster members and that 
this is currently my most satisfying design, I can break my problem 
space (for bucket arrangement) down into 3 basic cases :


1) Node joins
2) Node leaves in controlled fashion
3) Node dies

If the node under discussion is the only cluster member, then no 
bucket rearrangement is necessary - this node will either create or 
destroy the full set of buckets. I'll leave this set of subcases as 
trivial.


1)  The joining node will need to assume responsibility for a number 
of buckets. If buckets-per-node is to be kept roughly the same for 
every node, it is likely that the joining node will require transfer 
of a small number of buckets from every current cluster member i.e. 
we are starting a bucket rearrangement that will involve every 
cluster member and only need be done if the join is successful. So, 
although we wish to avoid an SPoF, if that SPoF turns out to be the 
joining node, then I don't see it as a problem, If the node joining 
dies, then we no longer have to worry about rearranging our buckets 
(unless we have lost some that had already been transferred - see 
(3)). Thus the joining node may be used as a single 
Coordinator/Leader for this negotiation without fear of the SPoF 
problem. Are we on the same page here ?


2) The same argument may be applied in reverse to a node leaving in 
a controlled fashion. It will wish to evacuate its buckets roughly 
equally to all remaining cluster members. If it shuts down cleanly, 
this would form part of its shutdown protocol. If it dies before or 
during the execution of this protocol then we are back at (3), if 
not, then the SPoF issue may again be put to one side.


3) This is where things get tricky :-) Currently WADI has, for the 
sake of simplicity, one single algorithm / thread / point-of-failure 
which recalculates a complete bucket arrangement if it detects (1), 
(2) or (3). It would be simple enough to offload the work done for 
(1) and (2) to the node joining/leaving and this should reduce 
wadi's current vulnerability, but we still need to deal with 
catastrophic failure. Currently WADI rebuilds the missing buckets by 
querying the cluster for the locations of any sessions that fall 
within them, but it could equally carry a replicated backup and dust 
it off as part of this procedure. It's just a trade-off between work 
done up front and work done in exceptional circumstance... This is 
the place where the Paxos algorithm may come in handy - bucet 
recomposition and rearrangement. I need to give this further 
thought. For the immediate future, however, I think WADI will stay 
with a single Coordinator in this situation, which fails-over if 
http://activecluster.codehaus.org says it should - I'm delegating 
the really thorny problem to James :-). I agree with you that this 
is an SPoF and that WADI's ability to recover from failure here 
depends directly on how we decide if a node is alive or dead - a 
very tricky thing to do.



Re: Clustering (long)

2005-08-03 Thread Jules Gosnell

Joe Bohn wrote:

You can define an order to the semaphores when locking and thereby 
avoid a deadlock.


Good idea - If I order the nodes according to some unique id and try for 
the lease on their bucket table in the same order, then multiple nodes 
trying at the same time should not deadlock... - it will take a little 
longer, since I will be acquiring locks sequentially, not concurrently, 
but ...


I order locks in a single vm all the time, yet didn't make the mental 
leap to doing it in different vms without your pointing it out -Thanks :-)


Jules

If each node being added or terminating itself honors the order then 
you will never have a deadlock.   However, you still need to deal with 
the case of an uncontrolled failure either adding or removing a note 
and possibly never releasing a lock.


Joe

Jules Gosnell wrote:


hmm... hmmm... :-)

more thoughts on (1) and (2)...

When a node leaves/joins it needs to acquire a lease on the bucket 
tables of every node that it intends to move buckets from/to. If two 
nodes are doing this at the same time, their requirement will collide 
(deadlock) somewhere in the cluster. At this point they may be 
notified and e.g. compare ip addresses to decide who continues and 
who backs off for a while.


So, (1) and (2), whilst being possible are probably more complex than 
I initially imagined. If we have Paxos for the more general purpose 
case (3) anyway, it would probably be smart just to go with this, 
until such optimisations becomes necessary, if at all.


Jules


Jules Gosnell wrote:


hmmm...

now I'm wondering about my solutions to (1) and (2) - if more than 
one node tries to join or leave at the same time I may be in trouble 
- so it may be safer to go straight to (3) for all cases...


more thought needed :-)

Jules



Jules Gosnell wrote:

I've had a look at the Lampson paper, but didn't take it all in on 
the first pass - I think it will need some serious concentration. 
The Paxos algorithm looks interesting, I will definitely pursue 
this avenue.


I've also given a little thought to exactly why I need a 
Coordinator and how Paxos might be used to replace it. My use of a 
Coordinator and plans for its future do not actually seem that far 
from Paxos, on a preliminary reading.


Given that WADI currently uses a distributed map of 
sessionId:sessionLocation, that this distribution is achieved by 
sharing out responsibility for the set number of buckets that 
comprise the map roughly evenly between the cluster members and 
that this is currently my most satisfying design, I can break my 
problem space (for bucket arrangement) down into 3 basic cases :


1) Node joins
2) Node leaves in controlled fashion
3) Node dies

If the node under discussion is the only cluster member, then no 
bucket rearrangement is necessary - this node will either create or 
destroy the full set of buckets. I'll leave this set of subcases as 
trivial.


1)  The joining node will need to assume responsibility for a 
number of buckets. If buckets-per-node is to be kept roughly the 
same for every node, it is likely that the joining node will 
require transfer of a small number of buckets from every current 
cluster member i.e. we are starting a bucket rearrangement that 
will involve every cluster member and only need be done if the join 
is successful. So, although we wish to avoid an SPoF, if that SPoF 
turns out to be the joining node, then I don't see it as a problem, 
If the node joining dies, then we no longer have to worry about 
rearranging our buckets (unless we have lost some that had already 
been transferred - see (3)). Thus the joining node may be used as a 
single Coordinator/Leader for this negotiation without fear of the 
SPoF problem. Are we on the same page here ?


2) The same argument may be applied in reverse to a node leaving in 
a controlled fashion. It will wish to evacuate its buckets roughly 
equally to all remaining cluster members. If it shuts down cleanly, 
this would form part of its shutdown protocol. If it dies before or 
during the execution of this protocol then we are back at (3), if 
not, then the SPoF issue may again be put to one side.


3) This is where things get tricky :-) Currently WADI has, for the 
sake of simplicity, one single algorithm / thread / 
point-of-failure which recalculates a complete bucket arrangement 
if it detects (1), (2) or (3). It would be simple enough to offload 
the work done for (1) and (2) to the node joining/leaving and this 
should reduce wadi's current vulnerability, but we still need to 
deal with catastrophic failure. Currently WADI rebuilds the missing 
buckets by querying the cluster for the locations of any sessions 
that fall within them, but it could equally carry a replicated 
backup and dust it off as part of this procedure. It's just a 
trade-off between work done up front and work done in exceptional 
circumstance... This is the place where the Paxos algorithm may 
come in handy - bucet recomposition 

Re: Clustering (long)

2005-08-02 Thread Andy Piper

Hi Jules

At 05:37 AM 7/27/2005, Jules Gosnell wrote:

I agree on the SPoF thing - but I think you misunderstand my 
Coordinator arch. I do not have a single static Coordinator node, 
but a dynamic Coordinator role, into which a node may be elected. 
Thus every node is a potential Coordinator. If the elected 
Coordinator dies, another is immediately elected. The election 
strategy is pluggable, although it will probably end up being 
hardwired to oldest-cluster-member. The reason behind this is that 
relaying out your cluster is much simpler if it is done in a single 
vm. I originally tried to do it in multiple vms, each taking 
responsibility for pieces of the cluster, but if the vms views are 
not completely in sync, things get very hairy, and completely in 
sync is an expensive thing to achieve - and would introduce a 
cluster-wide single point of contention. So I do it in a single vm, 
as fast as I can, with fail over, in case that vm evaporates. Does 
that sound better than the scenario that you had in mind ?


This is exactly the hard computer science problem that you 
shouldn't be trying to solve if at all possible. Its hard because 
network partitions or hung processes (think GC) make it very easy for 
your colleagues to think you are dead when you do not share that 
view. The result is two processes who think they are the coordinator 
and anarchy can ensue (commonly called split-brain syndrome). I can 
point you at papers if you want, but I really suggest that you aim 
for an implementation that is independent of a central coordinator. 
Note that a central coordinator is necessary if you want to implement 
a strongly-consistent in-memory database, but this is not usually a 
requirement for session replication say.


http://research.microsoft.com/Lampson/58-Consensus/Abstract.html 
gives a good introduction to some of these things. I also presented 
at JavaOne on related issues, you should be able to download the 
presentation from dev2dev.bea.com at some point (not there yet - I 
just checked).


The Coordinator is not there to support session replication, but 
rather the management of the distributed map (map of which a few 
buckets live on each node) which is used by WADI to discover very 
efficiently whether a session exists and where it is located. This 
map must be rearranged, in the most efficient way possible, each 
time a node joins or leaves the cluster.


Understood. Once you have a fault-tolerant singleton coordinator you 
can solve lots of interesting problems, its just hard and often not 
worth the effort or the expense (typical implementations involve HA 
HW or an HA DB or at least 3 server processes).


Replication is NYI - but I'm running a few mental background threads 
that suggest that an extension to the index will mean that it 
associates the session's id not just to its current location, but 
also to the location of a number of replicants. I also have ideas on 
how a session might choose nodes into which it will place its 
replicants and how I can avoid the primary session copy ever being 
colocated with a replicant (potential SPoF - if you only have one 
replicant), etc...


Right definitely something you want to avoid.

Yes, I can see that happening - I have an improvement (NYI) to 
WADI's evacuation strategy (how sessions are evacuated when a node 
wishes to leave). Each session will be evacuated to the node which 
owns the bucket into which its id hashes. This is because colocation 
of the session with the bucket allows many messages concered with 
its future destruction and relocation to be optimised away. Future 
requests falling elsewhere but needing this session should, in the 
most efficient case, be relocated to this same node, other wise the 
session may be relocated, but at a cost...


How do you relocate the request? Many HW load-balancers do not 
support this (or else it requires using proprietary APIs), so you 
probably have to count on

moving sessions in the normal failover case.

I would be very grateful in any thoughts or feedback that you could 
give me. I hope to get much more information about WADI into the 
wiki over the next few weeks. That should help generate more 
discussion, although I would be more than happy for people to ask me 
questions here on Geronimo-dev because this will give me an idea of 
what documentation I should write and how existing documentation may 
be lacking or misleading.


I guess my general comment would be that you might find it better to 
think specifically about the end-user problem you are trying to solve 
(say session replication) and work towards a solution based on that. 
Most short-cuts / optimizations that vendors make are specific to the 
problem domain and do not generally apply to all clustering problems.


Hope this helps

andy 





Re: Clustering (long)

2005-08-02 Thread Jules Gosnell

Andy Piper wrote:


Hi Jules

At 05:37 AM 7/27/2005, Jules Gosnell wrote:

I agree on the SPoF thing - but I think you misunderstand my 
Coordinator arch. I do not have a single static Coordinator node, but 
a dynamic Coordinator role, into which a node may be elected. Thus 
every node is a potential Coordinator. If the elected Coordinator 
dies, another is immediately elected. The election strategy is 
pluggable, although it will probably end up being hardwired to 
oldest-cluster-member. The reason behind this is that relaying out 
your cluster is much simpler if it is done in a single vm. I 
originally tried to do it in multiple vms, each taking responsibility 
for pieces of the cluster, but if the vms views are not completely in 
sync, things get very hairy, and completely in sync is an expensive 
thing to achieve - and would introduce a cluster-wide single point of 
contention. So I do it in a single vm, as fast as I can, with fail 
over, in case that vm evaporates. Does that sound better than the 
scenario that you had in mind ?



This is exactly the hard computer science problem that you shouldn't 
be trying to solve if at all possible. Its hard because network 
partitions or hung processes (think GC) make it very easy for your 
colleagues to think you are dead when you do not share that view. The 
result is two processes who think they are the coordinator and anarchy 
can ensue (commonly called split-brain syndrome). I can point you at 
papers if you want, but I really suggest that you aim for an 
implementation that is independent of a central coordinator. Note that 
a central coordinator is necessary if you want to implement a 
strongly-consistent in-memory database, but this is not usually a 
requirement for session replication say.


http://research.microsoft.com/Lampson/58-Consensus/Abstract.html gives 
a good introduction to some of these things. I also presented at 
JavaOne on related issues, you should be able to download the 
presentation from dev2dev.bea.com at some point (not there yet - I 
just checked).


OK - I will have a look at these papers and reconsider... perhaps I can 
come up with some sort of fractal algorithm which recursively breaks 
down the cluster into subclusters each of which is capable of doing 
likewise to itself and then  layout the buckets recursively via this 
metaphor... - this would be much more robust, as you point out, but, I 
think, a more complicated architecture. I will give it some serious 
thought. Have you any suggestions/papers as to how you might do 
something like this in a distributed manner, bearing in mind that as a 
node joins, some existing nodes will see it as having joined and some 
will not yet have noticed and vice-versa on leaving




The Coordinator is not there to support session replication, but 
rather the management of the distributed map (map of which a few 
buckets live on each node) which is used by WADI to discover very 
efficiently whether a session exists and where it is located. This 
map must be rearranged, in the most efficient way possible, each time 
a node joins or leaves the cluster.



Understood. Once you have a fault-tolerant singleton coordinator you 
can solve lots of interesting problems, its just hard and often not 
worth the effort or the expense (typical implementations involve HA HW 
or an HA DB or at least 3 server processes).


Since I am only currently using the singleton coordinator for bucket 
arrangement, I may just live with it for the moment, in order to move 
forward, but make a note to replace it and start background threads on 
how that might be achieved...




Replication is NYI - but I'm running a few mental background threads 
that suggest that an extension to the index will mean that it 
associates the session's id not just to its current location, but 
also to the location of a number of replicants. I also have ideas on 
how a session might choose nodes into which it will place its 
replicants and how I can avoid the primary session copy ever being 
colocated with a replicant (potential SPoF - if you only have one 
replicant), etc...



Right definitely something you want to avoid.

Yes, I can see that happening - I have an improvement (NYI) to WADI's 
evacuation strategy (how sessions are evacuated when a node wishes to 
leave). Each session will be evacuated to the node which owns the 
bucket into which its id hashes. This is because colocation of the 
session with the bucket allows many messages concered with its future 
destruction and relocation to be optimised away. Future requests 
falling elsewhere but needing this session should, in the most 
efficient case, be relocated to this same node, other wise the 
session may be relocated, but at a cost...



How do you relocate the request? Many HW load-balancers do not support 
this (or else it requires using proprietary APIs), so you probably 
have to count on

moving sessions in the normal failover case.


If I can squeeze the behaviour 

Re: Clustering (long)

2005-08-02 Thread Jules Gosnell
I've had a look at the Lampson paper, but didn't take it all in on the 
first pass - I think it will need some serious concentration. The Paxos 
algorithm looks interesting, I will definitely pursue this avenue.


I've also given a little thought to exactly why I need a Coordinator and 
how Paxos might be used to replace it. My use of a Coordinator and plans 
for its future do not actually seem that far from Paxos, on a 
preliminary reading.


Given that WADI currently uses a distributed map of 
sessionId:sessionLocation, that this distribution is achieved by sharing 
out responsibility for the set number of buckets that comprise the map 
roughly evenly between the cluster members and that this is currently my 
most satisfying design, I can break my problem space (for bucket 
arrangement) down into 3 basic cases :


1) Node joins
2) Node leaves in controlled fashion
3) Node dies

If the node under discussion is the only cluster member, then no bucket 
rearrangement is necessary - this node will either create or destroy the 
full set of buckets. I'll leave this set of subcases as trivial.


1)  The joining node will need to assume responsibility for a number of 
buckets. If buckets-per-node is to be kept roughly the same for every 
node, it is likely that the joining node will require transfer of a 
small number of buckets from every current cluster member i.e. we are 
starting a bucket rearrangement that will involve every cluster member 
and only need be done if the join is successful. So, although we wish to 
avoid an SPoF, if that SPoF turns out to be the joining node, then I 
don't see it as a problem, If the node joining dies, then we no longer 
have to worry about rearranging our buckets (unless we have lost some 
that had already been transferred - see (3)). Thus the joining node may 
be used as a single Coordinator/Leader for this negotiation without fear 
of the SPoF problem. Are we on the same page here ?


2) The same argument may be applied in reverse to a node leaving in a 
controlled fashion. It will wish to evacuate its buckets roughly equally 
to all remaining cluster members. If it shuts down cleanly, this would 
form part of its shutdown protocol. If it dies before or during the 
execution of this protocol then we are back at (3), if not, then the 
SPoF issue may again be put to one side.


3) This is where things get tricky :-) Currently WADI has, for the sake 
of simplicity, one single algorithm / thread / point-of-failure which 
recalculates a complete bucket arrangement if it detects (1), (2) or 
(3). It would be simple enough to offload the work done for (1) and (2) 
to the node joining/leaving and this should reduce wadi's current 
vulnerability, but we still need to deal with catastrophic failure. 
Currently WADI rebuilds the missing buckets by querying the cluster for 
the locations of any sessions that fall within them, but it could 
equally carry a replicated backup and dust it off as part of this 
procedure. It's just a trade-off between work done up front and work 
done in exceptional circumstance... This is the place where the Paxos 
algorithm may come in handy - bucet recomposition and rearrangement. I 
need to give this further thought. For the immediate future, however, I 
think WADI will stay with a single Coordinator in this situation, which 
fails-over if http://activecluster.codehaus.org says it should - I'm 
delegating the really thorny problem to James :-). I agree with you that 
this is an SPoF and that WADI's ability to recover from failure here 
depends directly on how we decide if a node is alive or dead - a very 
tricky thing to do.


In conclusion then, I think that we have usefully identified a weakness 
that will become more relevant as the rest of WADI's features mature. 
The Lampson paper mentioned describes an algorithm for allowing nodes to 
reach a consensus on actions to be performed, in a redundant manner with 
no SPoF and I shall consider how this might replace WADI's currently 
single Coordintor, whilst also looking at performing other Coordination 
on joining/leaving nodes where its failure, coinciding with that of its 
host node, will be irrelevant, since the very condition that it was 
intended to resolve has ceased to exist.


How does that sound, Andy ? Do you agree with my thoughts on (1)  (2) ? 
This is great input - thanks,



Jules


Jules Gosnell wrote:


Andy Piper wrote:


Hi Jules

At 05:37 AM 7/27/2005, Jules Gosnell wrote:

I agree on the SPoF thing - but I think you misunderstand my 
Coordinator arch. I do not have a single static Coordinator node, 
but a dynamic Coordinator role, into which a node may be elected. 
Thus every node is a potential Coordinator. If the elected 
Coordinator dies, another is immediately elected. The election 
strategy is pluggable, although it will probably end up being 
hardwired to oldest-cluster-member. The reason behind this is that 
relaying out your cluster is much simpler if it is done in a single 

Re: Clustering (long)

2005-08-02 Thread Jules Gosnell

hmmm...

now I'm wondering about my solutions to (1) and (2) - if more than one 
node tries to join or leave at the same time I may be in trouble - so it 
may be safer to go straight to (3) for all cases...


more thought needed :-)

Jules



Jules Gosnell wrote:

I've had a look at the Lampson paper, but didn't take it all in on the 
first pass - I think it will need some serious concentration. The 
Paxos algorithm looks interesting, I will definitely pursue this avenue.


I've also given a little thought to exactly why I need a Coordinator 
and how Paxos might be used to replace it. My use of a Coordinator and 
plans for its future do not actually seem that far from Paxos, on a 
preliminary reading.


Given that WADI currently uses a distributed map of 
sessionId:sessionLocation, that this distribution is achieved by 
sharing out responsibility for the set number of buckets that comprise 
the map roughly evenly between the cluster members and that this is 
currently my most satisfying design, I can break my problem space (for 
bucket arrangement) down into 3 basic cases :


1) Node joins
2) Node leaves in controlled fashion
3) Node dies

If the node under discussion is the only cluster member, then no 
bucket rearrangement is necessary - this node will either create or 
destroy the full set of buckets. I'll leave this set of subcases as 
trivial.


1)  The joining node will need to assume responsibility for a number 
of buckets. If buckets-per-node is to be kept roughly the same for 
every node, it is likely that the joining node will require transfer 
of a small number of buckets from every current cluster member i.e. we 
are starting a bucket rearrangement that will involve every cluster 
member and only need be done if the join is successful. So, although 
we wish to avoid an SPoF, if that SPoF turns out to be the joining 
node, then I don't see it as a problem, If the node joining dies, then 
we no longer have to worry about rearranging our buckets (unless we 
have lost some that had already been transferred - see (3)). Thus the 
joining node may be used as a single Coordinator/Leader for this 
negotiation without fear of the SPoF problem. Are we on the same page 
here ?


2) The same argument may be applied in reverse to a node leaving in a 
controlled fashion. It will wish to evacuate its buckets roughly 
equally to all remaining cluster members. If it shuts down cleanly, 
this would form part of its shutdown protocol. If it dies before or 
during the execution of this protocol then we are back at (3), if not, 
then the SPoF issue may again be put to one side.


3) This is where things get tricky :-) Currently WADI has, for the 
sake of simplicity, one single algorithm / thread / point-of-failure 
which recalculates a complete bucket arrangement if it detects (1), 
(2) or (3). It would be simple enough to offload the work done for (1) 
and (2) to the node joining/leaving and this should reduce wadi's 
current vulnerability, but we still need to deal with catastrophic 
failure. Currently WADI rebuilds the missing buckets by querying the 
cluster for the locations of any sessions that fall within them, but 
it could equally carry a replicated backup and dust it off as part of 
this procedure. It's just a trade-off between work done up front and 
work done in exceptional circumstance... This is the place where the 
Paxos algorithm may come in handy - bucet recomposition and 
rearrangement. I need to give this further thought. For the immediate 
future, however, I think WADI will stay with a single Coordinator in 
this situation, which fails-over if http://activecluster.codehaus.org 
says it should - I'm delegating the really thorny problem to James 
:-). I agree with you that this is an SPoF and that WADI's ability to 
recover from failure here depends directly on how we decide if a node 
is alive or dead - a very tricky thing to do.


In conclusion then, I think that we have usefully identified a 
weakness that will become more relevant as the rest of WADI's features 
mature. The Lampson paper mentioned describes an algorithm for 
allowing nodes to reach a consensus on actions to be performed, in a 
redundant manner with no SPoF and I shall consider how this might 
replace WADI's currently single Coordintor, whilst also looking at 
performing other Coordination on joining/leaving nodes where its 
failure, coinciding with that of its host node, will be irrelevant, 
since the very condition that it was intended to resolve has ceased to 
exist.


How does that sound, Andy ? Do you agree with my thoughts on (1)  (2) 
? This is great input - thanks,



Jules


Jules Gosnell wrote:


Andy Piper wrote:


Hi Jules

At 05:37 AM 7/27/2005, Jules Gosnell wrote:

I agree on the SPoF thing - but I think you misunderstand my 
Coordinator arch. I do not have a single static Coordinator node, 
but a dynamic Coordinator role, into which a node may be elected. 
Thus every node is a potential Coordinator. If the 

Re: Clustering (long)

2005-07-27 Thread Jules Gosnell

Andy Piper wrote:


Hi Jules

It sounds like you've been working hard!


yes - I need a break :-)



I think you might find you run into reliability issues with a 
singleton coordinator. This is one of those well known Hard Problems 
and for session replication its not really necessary. In essence the 
coordinator is a single point of failure and making it reliable is 
provably non trivial.


I agree on the SPoF thing - but I think you misunderstand my Coordinator 
arch. I do not have a single static Coordinator node, but a dynamic 
Coordinator role, into which a node may be elected. Thus every node is a 
potential Coordinator. If the elected Coordinator dies, another is 
immediately elected. The election strategy is pluggable, although it 
will probably end up being hardwired to oldest-cluster-member. The 
reason behind this is that relaying out your cluster is much simpler if 
it is done in a single vm. I originally tried to do it in multiple vms, 
each taking responsibility for pieces of the cluster, but if the vms 
views are not completely in sync, things get very hairy, and completely 
in sync is an expensive thing to achieve - and would introduce a 
cluster-wide single point of contention. So I do it in a single vm, as 
fast as I can, with fail over, in case that vm evaporates. Does that 
sound better than the scenario that you had in mind ?


The Coordinator is not there to support session replication, but rather 
the management of the distributed map (map of which a few buckets live 
on each node) which is used by WADI to discover very efficiently whether 
a session exists and where it is located. This map must be rearranged, 
in the most efficient way possible, each time a node joins or leaves the 
cluster.


Replication is NYI - but I'm running a few mental background threads 
that suggest that an extension to the index will mean that it associates 
the session's id not just to its current location, but also to the 
location of a number of replicants. I also have ideas on how a session 
might choose nodes into which it will place its replicants and how I can 
avoid the primary session copy ever being colocated with a replicant 
(potential SPoF - if you only have one replicant), etc...




Cluster re-balancing is also a hairy issue in that its easy to run 
into cascading failures if you pro-actively move sessions when a 
server leaves the cluster.


Yes, I can see that happening - I have an improvement (NYI) to WADI's 
evacuation strategy (how sessions are evacuated when a node wishes to 
leave). Each session will be evacuated to the node which owns the bucket 
into which its id hashes. This is because colocation of the session with 
the bucket allows many messages concered with its future destruction and 
relocation to be optimised away. Future requests falling elsewhere but 
needing this session should, in the most efficient case, be relocated to 
this same node, other wise the session may be relocated, but at a cost...




I'm happy to talk more about these issues off-line if you want.


I would be very grateful in any thoughts or feedback that you could give 
me. I hope to get much more information about WADI into the wiki over 
the next few weeks. That should help generate more discussion, although 
I would be more than happy for people to ask me questions here on 
Geronimo-dev because this will give me an idea of what documentation I 
should write and how existing documentation may be lacking or misleading.


Please fire away,

Jules



Thanks

andy

At 02:31 PM 6/30/2005, Jules Gosnell wrote:


Guys,

I thought that it was high time that I brought you up to date with my
efforts in building a clustering layer for Geronimo.

The project, wadi.codehaus.org, started as an effort to build a
scalable clustered HttpSession implementation, but in doing this, I
have built components that should be useful in clustering the state
held in any tier of Geronimo e.g. OpenEJB SFSBs etc.

WADI (Web Application Distribution Infrastructure) has two main
elements - the vertical and the horizontal.

Vertically, WADI comprises a stack of pluggable stores. Each store has
a pluggable Evicter responsible for demoting aging Sessions
downwards. Requests arriving at the container are fed into the top of
the stack and progress downwards, until their corresponding Session is
found and promoted to the top, where the request is correctly rendered
in its presence.

Typically the top-level store is in Memory. Aging Sessions are demoted
downwards onto exclusively owned LocalDisc. The bottom-most store is a
database shared between all nodes in the Cluster. The first node
joining the Cluster promotes all Sessions from the database into
exclusively-owned store - e.g. LocalDisc. The last node to leave the
Cluster demotes all Sessions down back into the database.

Horizontally, all nodes in a WADI Cluster are connected (p2p) via a
Clustered Store component within this stack. This typically sits at
the boundary between exclusive 

Re: Clustering (long)

2005-07-21 Thread Andy Piper

Hi Jules

It sounds like you've been working hard!

I think you might find you run into reliability issues with a singleton 
coordinator. This is one of those well known Hard Problems and for session 
replication its not really necessary. In essence the coordinator is a 
single point of failure and making it reliable is provably non trivial.


Cluster re-balancing is also a hairy issue in that its easy to run into 
cascading failures if you pro-actively move sessions when a server leaves 
the cluster.


I'm happy to talk more about these issues off-line if you want.

Thanks

andy

At 02:31 PM 6/30/2005, Jules Gosnell wrote:

Guys,

I thought that it was high time that I brought you up to date with my
efforts in building a clustering layer for Geronimo.

The project, wadi.codehaus.org, started as an effort to build a
scalable clustered HttpSession implementation, but in doing this, I
have built components that should be useful in clustering the state
held in any tier of Geronimo e.g. OpenEJB SFSBs etc.

WADI (Web Application Distribution Infrastructure) has two main
elements - the vertical and the horizontal.

Vertically, WADI comprises a stack of pluggable stores. Each store has
a pluggable Evicter responsible for demoting aging Sessions
downwards. Requests arriving at the container are fed into the top of
the stack and progress downwards, until their corresponding Session is
found and promoted to the top, where the request is correctly rendered
in its presence.

Typically the top-level store is in Memory. Aging Sessions are demoted
downwards onto exclusively owned LocalDisc. The bottom-most store is a
database shared between all nodes in the Cluster. The first node
joining the Cluster promotes all Sessions from the database into
exclusively-owned store - e.g. LocalDisc. The last node to leave the
Cluster demotes all Sessions down back into the database.

Horizontally, all nodes in a WADI Cluster are connected (p2p) via a
Clustered Store component within this stack. This typically sits at
the boundary between exclusive and shared Stores. As requests fall
through the stack, looking for their corresponding Session they arrive
at the Clustered store, where, if the Session is present anywhere in
the Cluster, its location may be learnt. At this point, the Session
may be migrated in, underneath the incoming request, or, if its
current location is considered advantageous, the request may be
proxied or redirected to its remote location. As a node leaves the
Cluster, all its Sessions are evacuated to other nodes via this store,
so that they may continue to be actively maintained.

The space in which Session ids are allocated is divided into a fixed
number of Buckets. This number should be large enough such that
management of the Buckets may be divided between all nodes in the
Cluster roughly evenly. As nodes leave and join the Cluster, a single
node, the Coordinator, is responsible for re-Bucketing the Cluster -
i.e. reorganising who manages which Buckets and ensuring the safe
transfer of the minimum number of Buckets to implement the new
layout. The Coordinator is elected via a Pluggable policy. If the
Coordinator leaves or fails, a new one is elected. If a node leaves or
joins, buckets emigrate from it or immigrate into it, under the
control of the Coordinator, to/from the rest of the Cluster.

A Session may be efficiently mapped to a Bucket by simply %-ing its
ID's hashcode() by the number of Buckets in the Cluster.

A Bucket is a map of SessionID:Location, kept up to date with the
Location of every Session in the Cluster, of which the id falls into
its space. i.e. as Sessions are created, destroyed or moved around the
Cluster notifications are sent to the node managing the relevant
Bucket, informing it of the change.

In this way, if a node receives a request for a Session which it does
not own locally, it may pass a message to it, in a maximum of
typically two hops, by sending the message to the Bucket owner, who
then does a local lookup of the Sessions actual location and forwards
the message to it. If Session and Bucket can be colocated, this can
reduced to a single hop.

Thus, WADI provides a fixed and scalable substrate over the more fluid
arrangement that Cluster membership comprises, on top of which further
Clustered services may be built.

The above functionality exists in WADI CVS and I am currently working
on hardening it to the point that I would consider it production
strength. I will then consider the addition of some form of state
replication, so that, even with the catastrophic failure of a member
node, no state is lost from the Cluster.

I plan to begin integrating WADI with Geronimo as soon as a certified
1.0-based release starts to settle down. Certification is the most
immediate goal and clustering is not really part of the spec, so I
think it best to satisfy the first before starting on subsequent
goals.

Further down the road we need to consider the unification of session
id spaces used 

Re: Clustering (long)

2005-07-04 Thread Jules Gosnell

Dain Sundstrom wrote:

This is cool stuff and an excellent intro to your code (you should  
put this summary on your website :)


I suggest you start chatting with OpenEJB and ActiveMQ about a shared  
session key sooner rather then later as it could take a wile to get  
it into their code bases.


I'm just hammering out the portlet specs requirements with David, Jeff, 
Greg and Jan, then I will get back to the other teams.


Jules



I can't wait to see this stuff integrated,

-dain

On Jun 30, 2005, at 2:31 PM, Jules Gosnell wrote:


Guys,

I thought that it was high time that I brought you up to date with my
efforts in building a clustering layer for Geronimo.

The project, wadi.codehaus.org, started as an effort to build a
scalable clustered HttpSession implementation, but in doing this, I
have built components that should be useful in clustering the state
held in any tier of Geronimo e.g. OpenEJB SFSBs etc.

WADI (Web Application Distribution Infrastructure) has two main
elements - the vertical and the horizontal.

Vertically, WADI comprises a stack of pluggable stores. Each store has
a pluggable Evicter responsible for demoting aging Sessions
downwards. Requests arriving at the container are fed into the top of
the stack and progress downwards, until their corresponding Session is
found and promoted to the top, where the request is correctly rendered
in its presence.

Typically the top-level store is in Memory. Aging Sessions are demoted
downwards onto exclusively owned LocalDisc. The bottom-most store is a
database shared between all nodes in the Cluster. The first node
joining the Cluster promotes all Sessions from the database into
exclusively-owned store - e.g. LocalDisc. The last node to leave the
Cluster demotes all Sessions down back into the database.

Horizontally, all nodes in a WADI Cluster are connected (p2p) via a
Clustered Store component within this stack. This typically sits at
the boundary between exclusive and shared Stores. As requests fall
through the stack, looking for their corresponding Session they arrive
at the Clustered store, where, if the Session is present anywhere in
the Cluster, its location may be learnt. At this point, the Session
may be migrated in, underneath the incoming request, or, if its
current location is considered advantageous, the request may be
proxied or redirected to its remote location. As a node leaves the
Cluster, all its Sessions are evacuated to other nodes via this store,
so that they may continue to be actively maintained.

The space in which Session ids are allocated is divided into a fixed
number of Buckets. This number should be large enough such that
management of the Buckets may be divided between all nodes in the
Cluster roughly evenly. As nodes leave and join the Cluster, a single
node, the Coordinator, is responsible for re-Bucketing the Cluster -
i.e. reorganising who manages which Buckets and ensuring the safe
transfer of the minimum number of Buckets to implement the new
layout. The Coordinator is elected via a Pluggable policy. If the
Coordinator leaves or fails, a new one is elected. If a node leaves or
joins, buckets emigrate from it or immigrate into it, under the
control of the Coordinator, to/from the rest of the Cluster.

A Session may be efficiently mapped to a Bucket by simply %-ing its
ID's hashcode() by the number of Buckets in the Cluster.

A Bucket is a map of SessionID:Location, kept up to date with the
Location of every Session in the Cluster, of which the id falls into
its space. i.e. as Sessions are created, destroyed or moved around the
Cluster notifications are sent to the node managing the relevant
Bucket, informing it of the change.

In this way, if a node receives a request for a Session which it does
not own locally, it may pass a message to it, in a maximum of
typically two hops, by sending the message to the Bucket owner, who
then does a local lookup of the Sessions actual location and forwards
the message to it. If Session and Bucket can be colocated, this can
reduced to a single hop.

Thus, WADI provides a fixed and scalable substrate over the more fluid
arrangement that Cluster membership comprises, on top of which further
Clustered services may be built.

The above functionality exists in WADI CVS and I am currently working
on hardening it to the point that I would consider it production
strength. I will then consider the addition of some form of state
replication, so that, even with the catastrophic failure of a member
node, no state is lost from the Cluster.

I plan to begin integrating WADI with Geronimo as soon as a certified
1.0-based release starts to settle down. Certification is the most
immediate goal and clustering is not really part of the spec, so I
think it best to satisfy the first before starting on subsequent
goals.

Further down the road we need to consider the unification of session
id spaces used by e.g. the web and ejb tiers and introduction of an
ApplicationSession abstraction - an 

Re: Clustering (long)

2005-07-03 Thread Dain Sundstrom
This is cool stuff and an excellent intro to your code (you should  
put this summary on your website :)


I suggest you start chatting with OpenEJB and ActiveMQ about a shared  
session key sooner rather then later as it could take a wile to get  
it into their code bases.


I can't wait to see this stuff integrated,

-dain

On Jun 30, 2005, at 2:31 PM, Jules Gosnell wrote:


Guys,

I thought that it was high time that I brought you up to date with my
efforts in building a clustering layer for Geronimo.

The project, wadi.codehaus.org, started as an effort to build a
scalable clustered HttpSession implementation, but in doing this, I
have built components that should be useful in clustering the state
held in any tier of Geronimo e.g. OpenEJB SFSBs etc.

WADI (Web Application Distribution Infrastructure) has two main
elements - the vertical and the horizontal.

Vertically, WADI comprises a stack of pluggable stores. Each store has
a pluggable Evicter responsible for demoting aging Sessions
downwards. Requests arriving at the container are fed into the top of
the stack and progress downwards, until their corresponding Session is
found and promoted to the top, where the request is correctly rendered
in its presence.

Typically the top-level store is in Memory. Aging Sessions are demoted
downwards onto exclusively owned LocalDisc. The bottom-most store is a
database shared between all nodes in the Cluster. The first node
joining the Cluster promotes all Sessions from the database into
exclusively-owned store - e.g. LocalDisc. The last node to leave the
Cluster demotes all Sessions down back into the database.

Horizontally, all nodes in a WADI Cluster are connected (p2p) via a
Clustered Store component within this stack. This typically sits at
the boundary between exclusive and shared Stores. As requests fall
through the stack, looking for their corresponding Session they arrive
at the Clustered store, where, if the Session is present anywhere in
the Cluster, its location may be learnt. At this point, the Session
may be migrated in, underneath the incoming request, or, if its
current location is considered advantageous, the request may be
proxied or redirected to its remote location. As a node leaves the
Cluster, all its Sessions are evacuated to other nodes via this store,
so that they may continue to be actively maintained.

The space in which Session ids are allocated is divided into a fixed
number of Buckets. This number should be large enough such that
management of the Buckets may be divided between all nodes in the
Cluster roughly evenly. As nodes leave and join the Cluster, a single
node, the Coordinator, is responsible for re-Bucketing the Cluster -
i.e. reorganising who manages which Buckets and ensuring the safe
transfer of the minimum number of Buckets to implement the new
layout. The Coordinator is elected via a Pluggable policy. If the
Coordinator leaves or fails, a new one is elected. If a node leaves or
joins, buckets emigrate from it or immigrate into it, under the
control of the Coordinator, to/from the rest of the Cluster.

A Session may be efficiently mapped to a Bucket by simply %-ing its
ID's hashcode() by the number of Buckets in the Cluster.

A Bucket is a map of SessionID:Location, kept up to date with the
Location of every Session in the Cluster, of which the id falls into
its space. i.e. as Sessions are created, destroyed or moved around the
Cluster notifications are sent to the node managing the relevant
Bucket, informing it of the change.

In this way, if a node receives a request for a Session which it does
not own locally, it may pass a message to it, in a maximum of
typically two hops, by sending the message to the Bucket owner, who
then does a local lookup of the Sessions actual location and forwards
the message to it. If Session and Bucket can be colocated, this can
reduced to a single hop.

Thus, WADI provides a fixed and scalable substrate over the more fluid
arrangement that Cluster membership comprises, on top of which further
Clustered services may be built.

The above functionality exists in WADI CVS and I am currently working
on hardening it to the point that I would consider it production
strength. I will then consider the addition of some form of state
replication, so that, even with the catastrophic failure of a member
node, no state is lost from the Cluster.

I plan to begin integrating WADI with Geronimo as soon as a certified
1.0-based release starts to settle down. Certification is the most
immediate goal and clustering is not really part of the spec, so I
think it best to satisfy the first before starting on subsequent
goals.

Further down the road we need to consider the unification of session
id spaces used by e.g. the web and ejb tiers and introduction of an
ApplicationSession abstraction - an object which encapsulates all
e.g. web and ejb sessions associated with a particular client talking
to a particular application. This will allow WADI to maintain the

Re: Clustering (long)

2005-06-30 Thread Jeff Genender

Jules,

This is awesome stuff.  I look forward to playing with this stuff in 
Geronimo.  Let me know if you need a hand with something...


Jeff

Jules Gosnell wrote:

Guys,

I thought that it was high time that I brought you up to date with my
efforts in building a clustering layer for Geronimo.

The project, wadi.codehaus.org, started as an effort to build a
scalable clustered HttpSession implementation, but in doing this, I
have built components that should be useful in clustering the state
held in any tier of Geronimo e.g. OpenEJB SFSBs etc.

WADI (Web Application Distribution Infrastructure) has two main
elements - the vertical and the horizontal.

Vertically, WADI comprises a stack of pluggable stores. Each store has
a pluggable Evicter responsible for demoting aging Sessions
downwards. Requests arriving at the container are fed into the top of
the stack and progress downwards, until their corresponding Session is
found and promoted to the top, where the request is correctly rendered
in its presence.

Typically the top-level store is in Memory. Aging Sessions are demoted
downwards onto exclusively owned LocalDisc. The bottom-most store is a
database shared between all nodes in the Cluster. The first node
joining the Cluster promotes all Sessions from the database into
exclusively-owned store - e.g. LocalDisc. The last node to leave the
Cluster demotes all Sessions down back into the database.

Horizontally, all nodes in a WADI Cluster are connected (p2p) via a
Clustered Store component within this stack. This typically sits at
the boundary between exclusive and shared Stores. As requests fall
through the stack, looking for their corresponding Session they arrive
at the Clustered store, where, if the Session is present anywhere in
the Cluster, its location may be learnt. At this point, the Session
may be migrated in, underneath the incoming request, or, if its
current location is considered advantageous, the request may be
proxied or redirected to its remote location. As a node leaves the
Cluster, all its Sessions are evacuated to other nodes via this store,
so that they may continue to be actively maintained.

The space in which Session ids are allocated is divided into a fixed
number of Buckets. This number should be large enough such that
management of the Buckets may be divided between all nodes in the
Cluster roughly evenly. As nodes leave and join the Cluster, a single
node, the Coordinator, is responsible for re-Bucketing the Cluster -
i.e. reorganising who manages which Buckets and ensuring the safe
transfer of the minimum number of Buckets to implement the new
layout. The Coordinator is elected via a Pluggable policy. If the
Coordinator leaves or fails, a new one is elected. If a node leaves or
joins, buckets emigrate from it or immigrate into it, under the
control of the Coordinator, to/from the rest of the Cluster.

A Session may be efficiently mapped to a Bucket by simply %-ing its
ID's hashcode() by the number of Buckets in the Cluster.

A Bucket is a map of SessionID:Location, kept up to date with the
Location of every Session in the Cluster, of which the id falls into
its space. i.e. as Sessions are created, destroyed or moved around the
Cluster notifications are sent to the node managing the relevant
Bucket, informing it of the change.

In this way, if a node receives a request for a Session which it does
not own locally, it may pass a message to it, in a maximum of
typically two hops, by sending the message to the Bucket owner, who
then does a local lookup of the Sessions actual location and forwards
the message to it. If Session and Bucket can be colocated, this can
reduced to a single hop.

Thus, WADI provides a fixed and scalable substrate over the more fluid
arrangement that Cluster membership comprises, on top of which further
Clustered services may be built.

The above functionality exists in WADI CVS and I am currently working
on hardening it to the point that I would consider it production
strength. I will then consider the addition of some form of state
replication, so that, even with the catastrophic failure of a member
node, no state is lost from the Cluster.

I plan to begin integrating WADI with Geronimo as soon as a certified
1.0-based release starts to settle down. Certification is the most
immediate goal and clustering is not really part of the spec, so I
think it best to satisfy the first before starting on subsequent
goals.

Further down the road we need to consider the unification of session
id spaces used by e.g. the web and ejb tiers and introduction of an
ApplicationSession abstraction - an object which encapsulates all
e.g. web and ejb sessions associated with a particular client talking
to a particular application. This will allow WADI to maintain the
colocation of associated state, whilst moving and replicating it
around the Cluster.

If anyone would like to know more about WADI, please feel free to ask
me questions here on geronimo-dev or on wadi-dev.


Re: Clustering (long)

2005-06-30 Thread Jules Gosnell

Jeff Genender wrote:


Jules,

This is awesome stuff.  I look forward to playing with this stuff in 
Geronimo.  Let me know if you need a hand with something...



Thanks for the encouragement, Jeff. It was good to meet you at J1. I'll 
give you a shout when things get a little more organised.



Jules



Jeff

Jules Gosnell wrote:


Guys,

I thought that it was high time that I brought you up to date with my
efforts in building a clustering layer for Geronimo.

The project, wadi.codehaus.org, started as an effort to build a
scalable clustered HttpSession implementation, but in doing this, I
have built components that should be useful in clustering the state
held in any tier of Geronimo e.g. OpenEJB SFSBs etc.

WADI (Web Application Distribution Infrastructure) has two main
elements - the vertical and the horizontal.

Vertically, WADI comprises a stack of pluggable stores. Each store has
a pluggable Evicter responsible for demoting aging Sessions
downwards. Requests arriving at the container are fed into the top of
the stack and progress downwards, until their corresponding Session is
found and promoted to the top, where the request is correctly rendered
in its presence.

Typically the top-level store is in Memory. Aging Sessions are demoted
downwards onto exclusively owned LocalDisc. The bottom-most store is a
database shared between all nodes in the Cluster. The first node
joining the Cluster promotes all Sessions from the database into
exclusively-owned store - e.g. LocalDisc. The last node to leave the
Cluster demotes all Sessions down back into the database.

Horizontally, all nodes in a WADI Cluster are connected (p2p) via a
Clustered Store component within this stack. This typically sits at
the boundary between exclusive and shared Stores. As requests fall
through the stack, looking for their corresponding Session they arrive
at the Clustered store, where, if the Session is present anywhere in
the Cluster, its location may be learnt. At this point, the Session
may be migrated in, underneath the incoming request, or, if its
current location is considered advantageous, the request may be
proxied or redirected to its remote location. As a node leaves the
Cluster, all its Sessions are evacuated to other nodes via this store,
so that they may continue to be actively maintained.

The space in which Session ids are allocated is divided into a fixed
number of Buckets. This number should be large enough such that
management of the Buckets may be divided between all nodes in the
Cluster roughly evenly. As nodes leave and join the Cluster, a single
node, the Coordinator, is responsible for re-Bucketing the Cluster -
i.e. reorganising who manages which Buckets and ensuring the safe
transfer of the minimum number of Buckets to implement the new
layout. The Coordinator is elected via a Pluggable policy. If the
Coordinator leaves or fails, a new one is elected. If a node leaves or
joins, buckets emigrate from it or immigrate into it, under the
control of the Coordinator, to/from the rest of the Cluster.

A Session may be efficiently mapped to a Bucket by simply %-ing its
ID's hashcode() by the number of Buckets in the Cluster.

A Bucket is a map of SessionID:Location, kept up to date with the
Location of every Session in the Cluster, of which the id falls into
its space. i.e. as Sessions are created, destroyed or moved around the
Cluster notifications are sent to the node managing the relevant
Bucket, informing it of the change.

In this way, if a node receives a request for a Session which it does
not own locally, it may pass a message to it, in a maximum of
typically two hops, by sending the message to the Bucket owner, who
then does a local lookup of the Sessions actual location and forwards
the message to it. If Session and Bucket can be colocated, this can
reduced to a single hop.

Thus, WADI provides a fixed and scalable substrate over the more fluid
arrangement that Cluster membership comprises, on top of which further
Clustered services may be built.

The above functionality exists in WADI CVS and I am currently working
on hardening it to the point that I would consider it production
strength. I will then consider the addition of some form of state
replication, so that, even with the catastrophic failure of a member
node, no state is lost from the Cluster.

I plan to begin integrating WADI with Geronimo as soon as a certified
1.0-based release starts to settle down. Certification is the most
immediate goal and clustering is not really part of the spec, so I
think it best to satisfy the first before starting on subsequent
goals.

Further down the road we need to consider the unification of session
id spaces used by e.g. the web and ejb tiers and introduction of an
ApplicationSession abstraction - an object which encapsulates all
e.g. web and ejb sessions associated with a particular client talking
to a particular application. This will allow WADI to maintain the
colocation of associated state, whilst