[ https://issues.apache.org/jira/browse/CASSANDRA-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ryan King updated CASSANDRA-1418: --------------------------------- Fix Version/s: (was: 0.8) 1.0 > Automatic, online load balancing > -------------------------------- > > Key: CASSANDRA-1418 > URL: https://issues.apache.org/jira/browse/CASSANDRA-1418 > Project: Cassandra > Issue Type: Improvement > Reporter: Stu Hood > Fix For: 1.0 > > > h2. Goal > CASSANDRA-192 began with the intention of implementing full cluster load > balancing, but ended up being (wisely) limited to a manual load balancing > operation. This issue is an umbrella ticket for finishing the job of > implementing automatic, always-on load balancing. > It is possible to implement very efficient load balancing operations with a > single process directing the rebalancing of all nodes, but avoiding such a > central process and allowing individual nodes to make their own movement > decisions would be ideal. > h2. Components > h3. Optimal movements for individual nodes > h4. Ruhl > One such approach is the Ruhl algorithm described on 192: > https://issues.apache.org/jira/browse/CASSANDRA-192#action_12713079 . But as > described, it performs excessive movement for large hotspots, and can take a > long time to reach equilibrium. Consider the following ring: > ||token||load|| > |a|5| > |c|5| > |e|5| > |f|40| > |k|5| > Assuming that node 'a' is the first to discover that 'f' is overloaded: it > will apply Case 2, and assume half of 'f's load by moving to 'i', leaving > both with 20 units. But this is not a optimal movement, because both 'f' and > 'a/i' will still be holding data that they will need to give away. > Additionally, 'a/i' can't begin giving the data away until it has finished > receiving it. > If node 'e' is the first to discover that 'f' is overloaded, it will apply > Case 1, and 'f' will give half of its load to 'e' by moving to 'i'. Again, > this is a non-optimal movement, because it will result in both 'e' and 'f/i' > holding data that they need to give away. > h4. Adding load awareness to Ruhl > Luckily, there appears to be a simple adjustment to the Ruhl algorithm that > solves this problem by taking advantage of the fact that Cassandra knows the > total load of a cluster, and can use it to calculate the average/ideal load > ω. Once node j has decided it should take load from node i (based on the ε > value in Ruhl), rather than node j taking 1/2 of the load on node i, it > should chose a token such that either i or j ends up with a load within ε*ω > of ω. > Again considering the ring described above, and assuming ε == 1.0, the total > load for the 5 nodes is 60, giving a ω of 12. If node 'a' is the first to > discover 'f', it will choose to move to 'j' (a token that takes 12 or ω load > units from 'f'), leaving 'f' with a load of 28. When combined with the > improvement in the next section, this is closer to being an optimal movement, > because 'a/j' will at worst have ε*ω of load to give away, and 'f' is in a > position to start more movements. > h3. Automatic load balancing > Since the Ruhl algorithm only requires a node to make a decision based on > itself and one other node, it should be relatively straightforward to add a > timer on each node that periodically wakes up and executes the modifiied Ruhl > algorithm if it is not already in the process of moving (based on pending > ranges). > Automatic balancing should probably be enabled by default, and should have a > configurable per-node bandwidth cap. > h3. Allowing concurrent moves on a node > Allowing a node to give away multiple ranges at once allows for the type of > quick balancing that is typically only attributed to vnodes. If a node is a > hotspot, such as in the example above, the node should be able to quickly > dump the load in a manner that causes minimal load on the rest of the > cluster. Rather than transferring to 1 target at 10 MB/s, a hotspot can give > to 5 targets at 2 MB/s each. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira