I know this is a little early, but, you don't have to read it :-)
I've been playing around with some ideas about [web] clustering and distribution of session state that I would like to share...
My last iteration on this theme (Jetty's distributable session manager) left me with an implementation in which every node in the cluster held a backup copy of every other nodes state.
This is fine for small clusters, but the larger the cluster and total state, the less it scales, because every node needs more memory and more bandwidth to deal with holding and keeping up to date, these backups.
The standard way around this is to 'partition' your cluster. i.e. break it up into lots of subclusters, which are small enough not to display this problem. This complicates load-balancing (as the load-balancer needs to know which nodes belong to which partitions i.e. carry which state). Furthermore, in my implementation, the partitions needed to be configured statically, which is a burden in terms of time and understanding on the administrator and means that their layout will probably end up being suboptimal or, even worse, broken. You also need more redundancy because a static configuration cannot react to runtime events, by say repartitioning, if a node goes down, so you would be tempted to put more nodes in each partition etc...
So, I am now thinking about dynamically 'auto-partitioning' clusters with coordinated load-balancing.
The idea is to build a reliable underlying substrate of partitions which reorganise dynamically as nodes leave and join the cluster.
I have something up and running which does this OK - basically nodes are arranged in a circle. Each node and the however-many-nodes-you-want-in-a-partition before it, form a partition. Partitions therefore overlap around the clock. In fact, every node will belong to the same number of partitions as there are members in each one.
As a node joins the cluster, the partitions are rearranged to accomodate it whilst maintaining this strategy. It transpires that n-1 (where n is the number of nodes per partition) nodes directly behind the new arrival have to pass ownership of an existing partition membership to the new node and a new partition has to be created encompassing these nodes and the new one. So for each new node a total of n-1 partition releases and n partition acquisitions occurs.
As a node leaves the cluster, exactly the same happens in reverse. This applies until the total number of nodes in the cluster reaches and decreases below n, in which case you can optimise the number of partitions down to 1.
I'm happy (currently) with this arrangement as far as a robust underlying platform for state replication. The question is whether the 'buckets' that state is held in should map directly to these partitions (i.e. each partition maintains a single bucket), or whether their granularity should be finer.
I'm still thinking about that at the moment, but thought I would throw my thoughts so far into the ring and see how they go..
If you map 1 bucket:1 partition you will end up with a very unevenly balanced, in terms of state, cluster. Every time a node joins, although it will acquire n-1 'buddies' and copies of their state, you won't be able to balance-up the existing state across the cluster because to give it some of it's own state you would need to take it from another partition. The mapping is 1:1 so the smallest unit of state that you can trasfer from one partition to another is it's entire contents/bucket.
The other extreme is 1 bucket:1 session (assuming we are talking session state here) - but then why have buckets.. ?
My next thought was that the number of buckets should be the square of the number of nodes in the cluster. That way, every node carries (where n is the total number of nodes in the cluster) n buckets. When you add a new node you can tranfer a bucket from each existing node to it. Each other node creates two empty buckets and the new node creates one. State is nicely balanced etc... However, this means that the number of buckets increases exponentially. Which smacks of something unscalable to me and was exactly the reason that I added partitions. In the web world, I shall probably have to map a bucket to an e.g. mod_jk worker. (this entails appending the name of the bucket to the session-id and telling Apache/mod_jk to associate a particular host:port with this name. I can alter the node on which the bucket is held, by informing Apache/mod_jk that this mapping has changed. Moving a session from one bucket to another is problematic and probably impossible as I would have to change the session-id that the client holds. Whilst I could do this if I waited for a request from a single threaded client, with a multi-threaded client it becomes much more difficult...). If we had 25 nodes in the cluster, this would mean 625 Apache/mod_jk workers... I'm not sure that mod_jk would like this :-).
So, I am currently thinking about having a fixed number of buckets per node which increases linearly with the total size of the cluster, or a 1:1 mapping and living with the inability to immediately balance the cluster by dynamically altering the node weightings in the loadbalancer (mod_jk), so that the cluster soon evens out again...
I guess that's enough for now....
If this is an area that interests you, please give me your thoughts and we can massage them all together :-)
Jules
-- /************************************* * Jules Gosnell * Partner * Core Developers Network (Europe) * http://www.coredevelopers.net *************************************/
