On Sat, Mar 17, 2012 at 7:38 AM, Sam Overton <s...@acunu.com> wrote:

> Hello cassandra-dev,
>
> This is a long email. It concerns a significant change to Cassandra, so
> deserves a thorough introduction.
>
> *The summary is*: we believe virtual nodes are the way forward. We would
> like to add virtual nodes to Cassandra and we are asking for comments,
> criticism and collaboration!
>
> Cassandra's current partitioning scheme is sub-optimal for bootstrap,
> decommission, repair and re-balance operations, and places the burden on
> users to properly calculate tokens (a common cause of mistakes), which is a
> recurring pain-point.
>
> Virtual nodes have a variety of benefits over the one-to-one mapping of
> host to key range which Cassandra currently supports.
>
> Among these benefits are:
>
> * Even load balancing when growing and shrinking the cluster
> A virtual node scheme ensures that all hosts in a cluster have an even
> portion of the total data, and a new node bootstrapped into the cluster
> will assume its share of the data. Doubling, or halving the cluster to
> ensure even load distribution would no longer be necessary.
>
> * Distributed rebuild
> When sizing a cluster, one of the considerations is the amount of time
> required to recover from a failed node. This is the exposure time, during
> which a secondary failure could cause data loss. In order to guarantee an
> upper bound on the exposure time, the amount of data which can be stored on
> each host is limited by the amount of time taken to recover the required
> replica count. At Acunu we have found that the exposure time is frequently
> the limiting factor which dictates the maximum allowed node size in
> customers' clusters.
>
> Using a virtual node scheme, the data stored on one host is not replicated
> on just RF-1 other physical hosts. Each virtual node is replicated to RF-1
> other virtual nodes which may be on a different set of physical hosts to
> replicas of other virtual nodes stored on the same host. This means data
> for one host is replicated evenly across the entire cluster.
>
> In the event of a failure then, restoring the replica count can be done in
> a fully distributed way. Each host in the cluster participates in the
> rebuild, drastically reducing the exposure time, allowing more data to be
> stored on a single host while still maintaining an acceptable upper bound
> on the likelihood of secondary failure. This reduces TCO concerns.
>
> * Greater failure tolerance in streaming
> Operations which require streaming of a large range of data, eg. bootstrap,
> decommission, repair, etc. incur a heavy cost if an error (eg. dropped
> network connection) is encountered during the streaming. Currently the
> whole range must be re-streamed, and this could constitute a very large
> amount of data. Virtual nodes reduce the impact of streaming failures,
> since each virtual node is a much smaller range of the key-space, so
> re-streaming a whole virtual node is a much cheaper process.
>
> * Evenly distributed impact of streaming operations
> Streaming operations such as bootstrap, repair, et al. would involve every
> node in the cluster. This would distribute the load of these operations
> across the whole cluster, and could be staggered so that only a small
> subset of nodes were affected at once, similar to staggered repair[1].
>
> * Possibility for active load balancing
> Load balancing in Cassandra currently involves moving a token to
> increase/reduce the amount of key-space for which a host is responsible.
> This only allows load balancing between neighbouring nodes, so it could
> involve moving more than one token just to redistribute a single overloaded
> node. Virtual nodes could allow load balancing on a much finer granularity,
> so heavily loaded portions of the key-space could be redistributed to
> lighter-loaded hosts by reassigning one or more virtual nodes.
>
>
> Implementing a virtual node scheme in Cassandra is not an insignificant
> amount of work, and it will touch a large amount of the codebase related to
> partitioning, placement, routing, gossip, and so on. We do believe that
> this is possible to do incrementally, and in such a way that there is an
> easy upgrade path for pre-virtual-node deployments.
>
> It would not however touch the storage layer. The virtual node concept is
> solely for partitioning and placement, not for segregating the data storage
> of the host, so all keys for all virtual nodes on a host would be stored in
> the same SSTables.
>
> We are not proposing the adoption of the same scheme used by Voldemort[2]
> and described in the Dynamo paper[3]. We feel this scheme is too different
> from Cassandra's current distribution model to be a viable target for
> incremental development. Their scheme also fixes the number of virtual
> nodes for the lifetime of the cluster, which can prove to be a ceiling to
> scaling the cluster if the virtual nodes grow too large.
>
> The proposed design is:
> * Assign each host T random tokens.
>

Will it work well for OrderedPartitioner?

For load balance purpose, is it possible to detect the access pattern of
range and move
the range dynamically instead of introducing virtual node?



> * A partition is assigned to a host for each of its tokens, where the
> partition is defined by the interval between a token and the previous token
> on the ring.
> * When a host joins the ring it is assigned T random tokens which will
> result in a portion of an existing partition being assigned to that host.
> * When a host leaves the ring it relinquishes its tokens which will result
> in its partitions becoming part of the neighbouring partitions.
>
> This is just a basic extension of Cassandra's existing distribution model,
> where instead of having 1 token per host, there are many tokens per host.
> It is the same scheme used by libketama[4] for consistent hashing among
> memcached instances, and is also the original scheme used by Dynamo as
> described in [3] before they migrated to their current scheme with fixed
> partitions.
>
> The random assignment of tokens may seem unintuitive given that currently
> in Cassandra a random token assigment leads to an unbalanced cluster. With
> many virtual nodes, a random token assignment leads to load being evenly
> balanced across the hosts in the cluster with high probability. As the
> number of virtual nodes is increased, the variance in load across hosts
> decreases, as demonstrated by simulation in [5].
>
> This scheme has the following properties - (where N is the number of hosts
> and B is the total data stored in the cluster):
> * placement metadata size is O(N) which is the same as in Cassandra
> currently
> * partition size is O(B/N) so as data is inserted, if individual partitions
> become too large then adding nodes to the cluster reduces this.
> * the strategy shares the following properties in common with Cassandra
> currently
> ** tokens are randomly assigned
> ** partitioning is determined by placement (and vice-versa)
> ** no two nodes may share the same token
> ** when a node leaves the ring, all of its tokens are removed - there is no
> exchanging of partitions between nodes
>
> One design concern is that replicas of a key range are not stored on the
> same physical host, as failure of that host could cause the loss of more
> than one replica of the data. This will be achieved by using a placement
> strategy very similar the the existing NetworkTopologyStrategy, which
> treats each individual host the same way as NTS treats a rack - that is
> replicas are not assigned to two hosts on the same rack.
>
> I will shortly create a ticket in JIRA to track discussion of this design.
> We have also done some simulation of this scheme to observe the load
> balancing properties, node size distribution, cluster resizing and so on. I
> will attach some results of this simulation to the JIRA ticket in due
> course.
>
> We are keen to get the ball rolling on this and we look forward to your
> input, ideas and recommendations.
>
> Best Regards,
>
> Sam Overton
>
> [1] Staggering repair:
> https://issues.apache.org/jira/browse/CASSANDRA-3721
> [2] Project Voldemort, Design: http://project-voldemort.com/design.php
> [3] Dynamo:
> http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
> [4] Ketama: Consistent Hashing:
> http://www.audioscrobbler.net/development/ketama/
> [5] Consistent Hashing:
> http://www.lexemetech.com/2007/11/consistent-hashing.html
>
> --
> Sam Overton
> Acunu | http://www.acunu.com | @acunu
>

Reply via email to