[ 
https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241332#comment-14241332
 ] 

Branimir Lambov commented on CASSANDRA-7032:
--------------------------------------------

Ignoring replication for the time being (more on that below), and looking at 
what's the best thing we can do when we have an existing setup and we are 
trying to add a node, I came up with the following approach.

We can only assign new vnodes, which means that we can only _take away_ load 
from other nodes, never add to it. On the one hand this means that 
underutilized nodes are hopeless until the cluster grows enough for their share 
to become normal. On the other it means that the best thing to do (aiming for 
the smallest overutilization, i.e. max deviation from mean) is to take the 
highest-load nodes and spread their load evenly between them and the new node.

Adding a new node gives us vnodes many (_vn_) new tokens to issue, i.e. we can 
decrease the load in at most _vn_ other nodes. We can pick up the _vn_ 
highest-load ones, but some of them may already have a lower load than the 
target spread; we thus select the largest _n <= vn_ highest load nodes such 
that the spread load _t_, which is their combined load divided by _n+1_, is 
lower than the load of each individual node. We can then choose how to assign 
_vn_ tokens splitting some of the ranges in these _n_ nodes to reduce the load 
of each node to _t_. This should also leave the new node with a load of _t_.

The attached code implements a simple version of this which improves 
overutilization very quickly with every new node-- a typical simulation looks 
like:
{code}
Random generation of 1000 nodes with 256 tokens each
Size 1000   max 1.24 min 0.80   No replication
Adding 1 node(s) using NoReplicationTokenDistributor
Size 1001   max 1.11 min 0.80   No replication
Adding 9 node(s) using NoReplicationTokenDistributor
Size 1010   max 1.05 min 0.81   No replication
Adding 30 node(s) using NoReplicationTokenDistributor
Size 1040   max 1.02 min 0.83   No replication
Adding 210 node(s) using NoReplicationTokenDistributor
Size 1250   max 1.00 min 1.00   No replication
{code}
It also constructs clusters from empty pretty well.

However, when replication is present the load distribution of this allocation 
does not look good (the added node tends to take much more than it should; one 
reason for this is that it becomes a replica of the token ranges it splits), 
which is not unexpected. I am now trying to see how exactly taking replication 
into account affects the reasoning above. We can still only remove load, but 
the way splitting affects the loads is not that clear any more.

As far as I can see the following simplification of Cassandra's replication 
strategies should suffice for handling the current and planned variations:
* we have units made up of a number of vnodes whose load we want to be able to 
balance (currently unit==node, but in the future the unit could be smaller (a 
disk or core))
* units are bunched up in racks (if racks are not defined, a node is implicitly 
a rack for its units)
* replicas of data must be placed on the closest higher vnodes that belong to 
different racks
* the replication strategy specifies the number of replicas and the set of 
units belonging to each rack

Datacentres are irrelevant as replication is specified within each dc, i.e. we 
can isolate the vnode allocation to the individual dc. If disk/core-level 
allocation is in place, the node boundaries within a rack can be ignored as 
well. Is there anything I'm missing?


[~benedict]: I believe you prefer to split the disk/core workload inside the 
node by assigning a token range (e.g. the vnodes that intersect with a range 
corresponding to _1/n_ of the token ring are to be handled by that disk/core). 
I prefer to just choose _1/n_ of the vnodes, because it lets me directly 
balance them-- do you have any objections to this?


> Improve vnode allocation
> ------------------------
>
>                 Key: CASSANDRA-7032
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benedict
>            Assignee: Branimir Lambov
>              Labels: performance, vnodes
>             Fix For: 3.0
>
>         Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, 
> TestVNodeAllocation.java
>
>
> It's been known for a little while that random vnode allocation causes 
> hotspots of ownership. It should be possible to improve dramatically on this 
> with deterministic allocation. I have quickly thrown together a simple greedy 
> algorithm that allocates vnodes efficiently, and will repair hotspots in a 
> randomly allocated cluster gradually as more nodes are added, and also 
> ensures that token ranges are fairly evenly spread between nodes (somewhat 
> tunably so). The allocation still permits slight discrepancies in ownership, 
> but it is bound by the inverse of the size of the cluster (as opposed to 
> random allocation, which strangely gets worse as the cluster size increases). 
> I'm sure there is a decent dynamic programming solution to this that would be 
> even better.
> If on joining the ring a new node were to CAS a shared table where a 
> canonical allocation of token ranges lives after running this (or a similar) 
> algorithm, we could then get guaranteed bounds on the ownership distribution 
> in a cluster. This will also help for CASSANDRA-6696.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to