Automatic, online load balancing
--------------------------------
Key: CASSANDRA-1418
URL: https://issues.apache.org/jira/browse/CASSANDRA-1418
Project: Cassandra
Issue Type: Improvement
Reporter: Stu Hood
Fix For: 0.8
h2. Goal
CASSANDRA-192 began with the intention of implementing full cluster load
balancing, but ended up being (wisely) limited to a manual load balancing
operation. This issue is an umbrella ticket for finishing the job of
implementing automatic, always-on load balancing.
It is possible to implement very efficient load balancing operations with a
single process directing the rebalancing of all nodes, but avoiding such a
central process and allowing individual nodes to make their own movement
decisions would be ideal.
h2. Components
h3. Optimal movements for individual nodes
h4. Ruhl
One such approach is the Ruhl algorithm described on 192:
https://issues.apache.org/jira/browse/CASSANDRA-192#action_12713079 . But as
described, it performs excessive movement for large hotspots, and can take a
long time to reach equilibrium. Consider the following ring:
||token||load||
|a|5|
|c|5|
|e|5|
|f|40|
|k|5|
Assuming that node 'a' is the first to discover that 'f' is overloaded: it will
apply Case 2, and assume half of 'f's load by moving to 'i', leaving both with
20 units. But this is not a optimal movement, because both 'f' and 'a/i' will
still be holding data that they will need to give away. Additionally, 'a/i'
can't begin giving the data away until it has finished receiving it.
If node 'e' is the first to discover that 'f' is overloaded, it will apply Case
1, and 'f' will give half of its load to 'e' by moving to 'i'. Again, this is a
non-optimal movement, because it will result in both 'e' and 'f/i' holding data
that they need to give away.
h4. Adding load awareness to Ruhl
Luckily, there appears to be a simple adjustment to the Ruhl algorithm that
solves this problem by taking advantage of the fact that Cassandra knows the
total load of a cluster, and can use it to calculate the average/ideal load ω.
Once node j has decided it should take load from node i (based on the ε value
in Ruhl), rather than node j taking 1/2 of the load on node i, it should chose
a token such that either i or j ends up with a load within ε*ω of ω.
Again considering the ring described above, and assuming ε == 1.0, the total
load for the 5 nodes is 60, giving a ω of 12. If node 'a' is the first to
discover 'f', it will choose to move to 'j' (a token that takes 12 or ω load
units from 'f'), leaving 'f' with a load of 28. When combined with the
improvement in the next section, this is closer to being an optimal movement,
because 'a/j' will at worst have ε*ω of load to give away, and 'f' is in a
position to start more movements.
h3. Automatic load balancing
Since the Ruhl algorithm only requires a node to make a decision based on
itself and one other node, it should be relatively straightforward to add a
timer on each node that periodically wakes up and executes the modifiied Ruhl
algorithm if it is not already in the process of moving (based on pending
ranges).
Automatic balancing should probably be enabled by default, and should have a
configurable per-node bandwidth cap.
h3. Allowing concurrent moves on a node
Allowing a node to give away multiple ranges at once allows for the type of
quick balancing that is typically only attributed to vnodes. If a node is a
hotspot, such as in the example above, the node should be able to quickly dump
the load in a manner that causes minimal load on the rest of the cluster.
Rather than transferring to 1 target at 10 MB/s, a hotspot can give to 5
targets at 2 MB/s each.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.