[
https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241332#comment-14241332
]
Branimir Lambov commented on CASSANDRA-7032:
--------------------------------------------
Ignoring replication for the time being (more on that below), and looking at
what's the best thing we can do when we have an existing setup and we are
trying to add a node, I came up with the following approach.
We can only assign new vnodes, which means that we can only _take away_ load
from other nodes, never add to it. On the one hand this means that
underutilized nodes are hopeless until the cluster grows enough for their share
to become normal. On the other it means that the best thing to do (aiming for
the smallest overutilization, i.e. max deviation from mean) is to take the
highest-load nodes and spread their load evenly between them and the new node.
Adding a new node gives us vnodes many (_vn_) new tokens to issue, i.e. we can
decrease the load in at most _vn_ other nodes. We can pick up the _vn_
highest-load ones, but some of them may already have a lower load than the
target spread; we thus select the largest _n <= vn_ highest load nodes such
that the spread load _t_, which is their combined load divided by _n+1_, is
lower than the load of each individual node. We can then choose how to assign
_vn_ tokens splitting some of the ranges in these _n_ nodes to reduce the load
of each node to _t_. This should also leave the new node with a load of _t_.
The attached code implements a simple version of this which improves
overutilization very quickly with every new node-- a typical simulation looks
like:
{code}
Random generation of 1000 nodes with 256 tokens each
Size 1000 max 1.24 min 0.80 No replication
Adding 1 node(s) using NoReplicationTokenDistributor
Size 1001 max 1.11 min 0.80 No replication
Adding 9 node(s) using NoReplicationTokenDistributor
Size 1010 max 1.05 min 0.81 No replication
Adding 30 node(s) using NoReplicationTokenDistributor
Size 1040 max 1.02 min 0.83 No replication
Adding 210 node(s) using NoReplicationTokenDistributor
Size 1250 max 1.00 min 1.00 No replication
{code}
It also constructs clusters from empty pretty well.
However, when replication is present the load distribution of this allocation
does not look good (the added node tends to take much more than it should; one
reason for this is that it becomes a replica of the token ranges it splits),
which is not unexpected. I am now trying to see how exactly taking replication
into account affects the reasoning above. We can still only remove load, but
the way splitting affects the loads is not that clear any more.
As far as I can see the following simplification of Cassandra's replication
strategies should suffice for handling the current and planned variations:
* we have units made up of a number of vnodes whose load we want to be able to
balance (currently unit==node, but in the future the unit could be smaller (a
disk or core))
* units are bunched up in racks (if racks are not defined, a node is implicitly
a rack for its units)
* replicas of data must be placed on the closest higher vnodes that belong to
different racks
* the replication strategy specifies the number of replicas and the set of
units belonging to each rack
Datacentres are irrelevant as replication is specified within each dc, i.e. we
can isolate the vnode allocation to the individual dc. If disk/core-level
allocation is in place, the node boundaries within a rack can be ignored as
well. Is there anything I'm missing?
[~benedict]: I believe you prefer to split the disk/core workload inside the
node by assigning a token range (e.g. the vnodes that intersect with a range
corresponding to _1/n_ of the token ring are to be handled by that disk/core).
I prefer to just choose _1/n_ of the vnodes, because it lets me directly
balance them-- do you have any objections to this?
> Improve vnode allocation
> ------------------------
>
> Key: CASSANDRA-7032
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7032
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Benedict
> Assignee: Branimir Lambov
> Labels: performance, vnodes
> Fix For: 3.0
>
> Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java,
> TestVNodeAllocation.java
>
>
> It's been known for a little while that random vnode allocation causes
> hotspots of ownership. It should be possible to improve dramatically on this
> with deterministic allocation. I have quickly thrown together a simple greedy
> algorithm that allocates vnodes efficiently, and will repair hotspots in a
> randomly allocated cluster gradually as more nodes are added, and also
> ensures that token ranges are fairly evenly spread between nodes (somewhat
> tunably so). The allocation still permits slight discrepancies in ownership,
> but it is bound by the inverse of the size of the cluster (as opposed to
> random allocation, which strangely gets worse as the cluster size increases).
> I'm sure there is a decent dynamic programming solution to this that would be
> even better.
> If on joining the ring a new node were to CAS a shared table where a
> canonical allocation of token ranges lives after running this (or a similar)
> algorithm, we could then get guaranteed bounds on the ownership distribution
> in a cluster. This will also help for CASSANDRA-6696.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)