Hi, Current implementation of loadbalance seems to work only for RackUnaware. Problem is only RackUnaware creates a "flat" replica space where all nodes are equal. This is not true for other strategies, since due to rack and DC considerations, replicas are not evenly distributed among nodes. To illustrate the problem, let us consider the following scenario:
- cluster with nodes A through H. Nodes B and G are in DC1, rest of the nodes in DC2. - DC shard strategy, factor 3 (without loss of generality we can omit rack considerations from this discussion). In this situation ranges would be (node, primary, replicas): A: H-A, F-G B: A-B, H-A C: B-C, H-A, A-B D: C-D, B-C E: D-E, C-D F: E-F, D-E G: F-G, A-B, B-C, C-D, D-E, E-F H: G-H, E-F, F-G Now in this situation most likely node G is by far the most loaded one, so if a node bootstraps (either a new one, or a loadbalance operation), it will take half of G's range. Problem is, it will take half of G's _primary_ range, but most of G's load comes from _replicas_. After this operation, the ring would be (X denotes the new node): A: H-A, F-G B: A-B, H-A C: B-C, H-A, A-B D: C-D, B-C E: D-E, C-D F: E-F, D-E X: F-X, A-B, B-C, C-D, D-E, E-F G: X-G, F-X H: G-H, E-F, F-X, X-G It is clear from this, that the situation has not really improved. The only difference is that X is now the most loaded node and G probably has a very light load. If another new node arrives, it will again go in front of X, but the situation will remain largely the same. In order to get rid of such replica sinks, nodes would need to consider also replica "categories". When we're looking for replica destinations, we essentially consider categories "in other DC", "in other rack" and "anywhere". When node X boots in the ring above, it should not just consider what is G's primary range, but what is G's effective range (primary plus replicas). Amount of replicas is determined largely by nodes that belong to the same replica category. If X belongs in DC1 (same as B and G), best balance would be gained if X booted in the middle of B and G, as that would divide replicas evenly. This might not always be the best place, because individual ranges might be very much different. In order to fix this, the ideal solution would be to modify load string so that, instead of total load, it reports both load from primary range and load from replicas. This would allow bootstrapping node to decide whether it should take half of replicas or half of the primary range in order to get optimal result. However, there is no way to get these two numbers, so we only have total load number. It would not be perfect, but perhaps for now it would be best to only consider nodes from the same category when making load balancing decisions. That is, for rack unaware we consider all nodes as always, but for other strategies we would determine the bootstrap token based on which nodes are in the same category. Don't know if this would work, but should be at least better than now. Another related issue is: now that we have strategy per table, how should we approach load balancing? Optimal decision for one strategy might be bad for another strategy. If we have just one strategy in use, that's clear, but for multiple strategies we'd need to determine which one to favor. Or am I thinking about this in a completely wrong way? -Jaakko