continuing from previous para...
there is one final issue which applies equally to this approach and any other failover situation (in which you don't assume that the dead node will be reincarnated - I think we should not rely on this)...
if A dies and all it's requests fail over to B, B ends up servicing all it's own requests and A's.
This is because B itself was already the primary location for a number of sessions.
three choices present themselves.
1. a hot-standby for each node - i.e. a buddy that is not a primary location for any sessions - this approach is hugely wasteful of resources and not really in the spirit of what we are trying to do, which is to have a homogeneous cluster, without specialist nodes...
2. B just has to live with it, although we could adjust it's weighting within the lb to cut the number of requests coming to it down to just those that already have a session with B as it's primary location. Eventually the number of sessions on B would die back down to a level similar to other nodes and you would put it's lb weighting back up.
3. B offloads state onto other nodes, and adjusts the lbs mapping accordingly. This is why I've been considering multiple buckets per node. If the 1,2,3 in the routing info were bucket ids, then the lb contains bucket-id:node mappings. B could migrate buckets of sessions to other nodes and alter bucket:node mappings in the lb. The JSESSIONID value held in the browser does not have to change, but the lb has to map this value to new physical locations.
mod_jk can do this and I shall confirm that Big-IP and Cisco stuff can do the same.
In summary then, my locate/migrate-less approach (sorry to keep downing your locate/migrate thang, Jeremy :-) ) requires the following fn-ality from a smart lb...
1. API to dynamically add/remove/remap a 'route' to a host:port 2. API to understand a 'routing-list' session-id suffix. 3. API to dynamically adjust 'weighting' on particular route.
This is pretty simple fn-ality, but with it, I think we can do pretty much everything that we need to
Jules
Jules Gosnell wrote:
inline... just one paragraph...
Jules Gosnell wrote:
Jeremy Boynes wrote:
let me qualify to that - it's OK if your cluster is small enough that it only effectively has one buddy-group. i.e. all nodes arry all the clusters state. Then you use sync replication and it works - scaling up will require a smarter LB.I'm going to pick up this thread again :-)
We just can't leave alone :-)
we have to deal with both dumb and integrated load-balancers...
DUMB LB:
(A) undivided cluster, simple approach:
every node buddies for every other node
no 'locate/migrate' required since every session is on every node
replication needs to be synchronous, in order to guarantee that node on
which nex request falls will be up-to-date
problem: unscalable
(B) subdivided cluster, more complex approach:
cluster subdivided into buddy teams (possibly only of pairs).
'locate/migrate' required since request may fall on node that does not
have session to hand
primary could use asyn and secondary sync replication, provided that
'locate' always talked to primary
sync and async are both options - sync may be needed for dumb clients (http
1.0 or ones which overlap requests e.g. for frames)
problem: given a cluster of n nodes divided into teams of t nodes: only
t/n requests will be able to avoid the 'locate/migrate' step - in a
large cluster with small teams, this is not much more efficient than a
shared store solution.
conclusion: DUMB LB is a bad choice in conjunction with a replication model
and shared state. Only recommended for use with stateless front ends.
I'm assuming that even the dumb one does this (i.e. it's an intrusive proxy rather than e.g. a non-intrusive round-robin-dns)
SMART LB (we're assuming it can do pretty much whatever we want it to).
We're assuming it is smart in that: 1) it can affinity sessions (including SSL if required 2) it can detect failed nodes
3) it has (possibly configurable) policies for failing over if a node dies
hmm...
exactly, and if you have session affinity you might weigh up the odds of a failure compounded by a consecutive request racing and beating an asyc update and decide to trade them for the runtime benefit of async vs sync replication.
I think that's all the capabilities it needs.
(A)
assuming affinity, we can use async replication, because request will
always fall on most up to date node.
async is a reliability/performance tradeoff - it introduces a window in
which modified state has not been replicated and may be lost. Again sync vs.
async should be configurable.
OK - this has been the route of some confusion. You have been assuming that affinity means 'stick-to-last-visited-node' whereas I am using the mod_jk reading which is something like 'stick-to-a-particular-node-and-if-you-lose-it-your-screwed'...
if this node fails, the lb MUST pick one to failover to and continue to
use that one (or else we have to fall back to sync and assume dumb lb)
if original node comes back up, it doesn't matter whether lb goes back
to it, or remains stuck to fail-over node.
Premis is the LB will pick a new node and affinity to it. How it picks the
new node is undefined (depends on how the LB works) and may result in a
locate/migrate step if it picks a node without state.
If the old node comes back it will old and will trigger a locate/migrate
step if the LB picks it (e.g. if it has a preferential affinity model).
you only lose state, as above, in the event of a failure compounded by a session losing a race with a request - unless you mean that you lose state because it didn't get off-node before the node went down - in which case, agreed.
(B)
if we can arrange for LB use affinity, with failover limited to our
buddy-group, and always stick to the failover node as well we can lose
'locate/migrate' and replicate asych. If we can't get 'always stick to
failover node', we replicate synch after failover.
Again async only works if you are willing to lose state.
when using JavaGroups, if you multicast your RMI, you have the choice of a number of modes including GET_ALL and GET_NONE. GET_ALL means that you wait for a reply from all nodes RMI-ed to (sync) , GET_NONE means that you don't wait for any (async). I may be wrong but I don't believe it takes any longer for your data to get off-node, you just don't keep the client hanging around whilst you wait for all your buddies to confirm their new state...
If you take this into account, JG-async is an attractive alternative since it further compounds the unlikeliness of stateloss to be: node-failure compounded with transport-failure compounded with a lost state-xfer/consecutive-request race...
if we can only arrange affinity, but not failover within group, we can
replicate asynch and need 'locate/migrate'. If we can't have
lb-remains-stuck-to-failover-node, we are in trouble, because as soon as
primary node fails we go back to the situation outlined above where we
do a lot of locate/migrate and are not much better off than a
shared store.
Don't get you on this one - maybe we have a different definition of
affinity: mine is that a request will always be directed back to the node
that served the last one unless that node becomes unavailable. This means
that a request goes to the last node that served it, not the one that
originally created the session.
agreed - our usage differs, as discussed above.
but you do. If affinity is only to the creating node and you lose it, you are then back to a stage where only num-buddies/num-nodes requests will hit the session-bearing node.... until you either bring the node back up or (this is where the idea of session bucketing is useful) call up the lb and remap the sessions routing info (usually e.g. the node name, but perhaps the bucket name) to another host:port combination.
Even if you have 'affinity to the node that created the session' then you
don't get a lot of locate/migrate - just a burst when the node comes back
online.
the burst will go on until the replacement node is up or the remapping is done and all sessions have been located/migratedback to their original node.
The lb-sticks-to-failover-node is not as simple as it sounds - mod_jk
doesn't do it.
:-(
it implies
either :
you have the ability to change the routing info carried on the session
id client side (I've considered this and don't think it practical - I
may be wrong ...)
I'm dubious about this too - it feels wrong but I can't see what it breaks.
I'm assuming that you set JSESSIONID to id.node with node always being the
last node that serves it. The LB tries to direct the request to node, but if
it is unavailable picks another from its configuration. If the new node does
not have state then you do locate/migrate.
not quite.
you set the session id ONCE, to e.g. (with mod_jk) <session-id>.<node-id>
resetting it is really problematic.
mod_jk maps (in it's workers.properties) this node-id to a host:port combination.
if the host:port is unavailable it chooses another node at random - it doesn't either try to reset the cookie/url-param with the client or remember the chosen node for subsequent requests for the same session,...
or :
the session id needs to carry not just a single piece of routing info
(like a mod_jk worker name) but a failover list worker1,worker2,worker3
etc in effect your buddy-team,
Again, requires that the client handles changes to JSESSIONID OK. This
allows the nodes to determine the buddy group and would reduce the chance of
locate/migrate being needed.
AHA ! this is where I introduce session buckets.
tying the routing info to node ids is a bad idea - because nodes die. I think we should suffix the session ids with bucket-id. sessions cannot move out of a bucket, but a bucket can move to another node. If this happens, you call up the lb and remap the bucket-id:host-port. You can do this for mod_jk by rewriting it's workers.properties and 'apachectl graceful' - nasty but solid. mod_jk2 actually has an http based api that could probably do this. Big-IP has a SOAP API which also will probably let you do this, etc...
The extra level of indirection introduced session-node -> session-bucket-node allows you this flexibility as well as the possibility of mapping more than one bucket to each node. Doing this will help in redistributing state around a cluster when buddy-groups are created/destroyed....
agreed - but now you have two pieces of infrastructure replicating instead of one - I don't see this as ideal. I would rather our solution worked for simpler lbs that don't support clever stuff like this...
or:
the lb needs to maintain state, remembering where each session was last
serviced and always sticking requests for that session to that node. in
a large deployment this requires lbs to replicate this state between
them so that they can balance over the same nodes in a coordinated
fashion. I think F5 Big-IP is capable of this, but effectively you just
shift the state problem from yourself to someone else.
Not quite - the LB's are sharing session-to-node affinity data which is very
small; the buddies are sharing session state which is much larger. You are
sharing the task not shifting it. Yes, the LB's can do this.
hmmm... I can't store buddy-group and bucket-info in the session id - it has to be one or the other - I'll think on this some more and come back - maybe tonight...
Note that if your lb can understand extended routing info involving the
whole buddy team, then you know that it will always balance requests to
members of this team anyway, in which case you can dispense with
'locate/migrate' again.
It would be useful if the LB did this but I don't think it's a requirement.
I don't think you can dispense with locate unless you are willing to lose
sessions.
For example, if the buddy group is originally (nodeA, nodeB) and both those
nodes get cycled out, then the LB will not be able to find a node even if
the cluster migrates the data to (nodeC, nodeD). When it see the request
come in and knows that A and B are unavailable, it will pick a random nod,
say nodeX, and X needs to be able to locate/migrate from C or D.
let's say you have nodes A,B,C,D...
your session cookie contains xxx.1,2,3 where xxx is the session id and 1,2,3 is the 'routing-info' - a failover list (I think BEA works like this...)
e.g. mod_jk will have :
1:A 2:B 3:C
A,B,C are buddies
A dies :-(
your lb will route requests carrying A,B,C to B, in the absence of A (I guess B will just have to put up with double the load - is this a problem - probably yes :-( - probably not just with this solution, but any form of affinity)
your buddy group is now unbalanced
the algorithm I am proposing (let's call it 'clock' because buddies are arranged e.g. 1+2,2+3,3+4,...11+12,12+1 etc) would close the hole in the clock by shifting a couple of buddies from group to group... state would need to shift around too.
whilst requests are still landing on B (note that there has been no need for locate/migrate), we can call up the lb and, hopefully remap to:
1:B 2:C 3:D
all that we have to do now, is to make sure that D is brought up to date with B and C by a bulk state transfer....
This approach avoided doing any locate/migrate...
I have definitely seen the routing list approach used somewhere and it probably would not be difficult to extend mod_jk[2] to do it. I will investigate whether Big-IP and Cisco's products might be able to do it...
If this fn-ality were commonly available in lbs, do you not think that this might be an effective way of cutting down further on the internode traffic in our cluster ?
Jules
that is ONE way - but you can also, using mod_jk and a hierarchical lb arrangement (where lb is the worker type, not the Apache process), split up your cluster into buddy-groups. Affinity is then done to the buddy-group, rather than the node. Sync replication within the group gives you a scalable solution. Effectively, each buddy-group becomes it's own mini-cluster.
This also saves the pre-emptive transfer of state from C back to A when A
rejoins - it only happens if nodeA gets selected.
Finally - you still need a migrate operation as sessions will need to
migrate from buddy-group to buddy-group as buddy-groups are created and
destroyed...
in summary - I think that you can optimise away 'locate' and a lot of 'migrate'-ion - Jetty's current impl has no locate and you can build subdivided clusters with it and mod_jk.... but I don't do automatic repartitioning yet....
IIRC Jetty's current impl does it by replicating to all nodes in the partition and I thought that's what you were trying to reduce :-)
this is correct - I am simply trying to demonstrate, that if we look hard enough, there should be solutions where even after node death we can avoid session locate/migrate.
The basic tradeoff is wide-replication vs. locate-after-death - they both
work, I just think locate-after-death results in less overhead during normal
operation at the cost of more after a membership change, which seems
preferable.
I would not feel easy performing maintenance on my cluster, if I new that each time I start/stop a node there was likely to be a huge flurry of activity as sessions are located/migrated everywhere - this is bound to impact on QoS, which is to be avoided..
Jules
If you are still reading here, then you are doing well :-)
Or you're just one sick puppy :-)
-- Jeremy
-- /********************************** * Jules Gosnell * Partner * Core Developers Network (Europe) **********************************/
