A friend pointed out to me privately that I came across pretty harsh in this thread. While I stand by my technical concerns, I do want to acknowledge that Sam's proposal here indicates a strong grasp of the principles involved, and a deeper level of thought into the issues than I think anyone else has brought to date. Thanks for putting that energy into it, Sam, and I look forward to seeing how you approach the implementation.
On Fri, Mar 16, 2012 at 6:38 PM, Sam Overton <s...@acunu.com> wrote: > Hello cassandra-dev, > > This is a long email. It concerns a significant change to Cassandra, so > deserves a thorough introduction. > > *The summary is*: we believe virtual nodes are the way forward. We would > like to add virtual nodes to Cassandra and we are asking for comments, > criticism and collaboration! > > Cassandra's current partitioning scheme is sub-optimal for bootstrap, > decommission, repair and re-balance operations, and places the burden on > users to properly calculate tokens (a common cause of mistakes), which is a > recurring pain-point. > > Virtual nodes have a variety of benefits over the one-to-one mapping of > host to key range which Cassandra currently supports. > > Among these benefits are: > > * Even load balancing when growing and shrinking the cluster > A virtual node scheme ensures that all hosts in a cluster have an even > portion of the total data, and a new node bootstrapped into the cluster > will assume its share of the data. Doubling, or halving the cluster to > ensure even load distribution would no longer be necessary. > > * Distributed rebuild > When sizing a cluster, one of the considerations is the amount of time > required to recover from a failed node. This is the exposure time, during > which a secondary failure could cause data loss. In order to guarantee an > upper bound on the exposure time, the amount of data which can be stored on > each host is limited by the amount of time taken to recover the required > replica count. At Acunu we have found that the exposure time is frequently > the limiting factor which dictates the maximum allowed node size in > customers' clusters. > > Using a virtual node scheme, the data stored on one host is not replicated > on just RF-1 other physical hosts. Each virtual node is replicated to RF-1 > other virtual nodes which may be on a different set of physical hosts to > replicas of other virtual nodes stored on the same host. This means data > for one host is replicated evenly across the entire cluster. > > In the event of a failure then, restoring the replica count can be done in > a fully distributed way. Each host in the cluster participates in the > rebuild, drastically reducing the exposure time, allowing more data to be > stored on a single host while still maintaining an acceptable upper bound > on the likelihood of secondary failure. This reduces TCO concerns. > > * Greater failure tolerance in streaming > Operations which require streaming of a large range of data, eg. bootstrap, > decommission, repair, etc. incur a heavy cost if an error (eg. dropped > network connection) is encountered during the streaming. Currently the > whole range must be re-streamed, and this could constitute a very large > amount of data. Virtual nodes reduce the impact of streaming failures, > since each virtual node is a much smaller range of the key-space, so > re-streaming a whole virtual node is a much cheaper process. > > * Evenly distributed impact of streaming operations > Streaming operations such as bootstrap, repair, et al. would involve every > node in the cluster. This would distribute the load of these operations > across the whole cluster, and could be staggered so that only a small > subset of nodes were affected at once, similar to staggered repair[1]. > > * Possibility for active load balancing > Load balancing in Cassandra currently involves moving a token to > increase/reduce the amount of key-space for which a host is responsible. > This only allows load balancing between neighbouring nodes, so it could > involve moving more than one token just to redistribute a single overloaded > node. Virtual nodes could allow load balancing on a much finer granularity, > so heavily loaded portions of the key-space could be redistributed to > lighter-loaded hosts by reassigning one or more virtual nodes. > > > Implementing a virtual node scheme in Cassandra is not an insignificant > amount of work, and it will touch a large amount of the codebase related to > partitioning, placement, routing, gossip, and so on. We do believe that > this is possible to do incrementally, and in such a way that there is an > easy upgrade path for pre-virtual-node deployments. > > It would not however touch the storage layer. The virtual node concept is > solely for partitioning and placement, not for segregating the data storage > of the host, so all keys for all virtual nodes on a host would be stored in > the same SSTables. > > We are not proposing the adoption of the same scheme used by Voldemort[2] > and described in the Dynamo paper[3]. We feel this scheme is too different > from Cassandra's current distribution model to be a viable target for > incremental development. Their scheme also fixes the number of virtual > nodes for the lifetime of the cluster, which can prove to be a ceiling to > scaling the cluster if the virtual nodes grow too large. > > The proposed design is: > * Assign each host T random tokens. > * A partition is assigned to a host for each of its tokens, where the > partition is defined by the interval between a token and the previous token > on the ring. > * When a host joins the ring it is assigned T random tokens which will > result in a portion of an existing partition being assigned to that host. > * When a host leaves the ring it relinquishes its tokens which will result > in its partitions becoming part of the neighbouring partitions. > > This is just a basic extension of Cassandra's existing distribution model, > where instead of having 1 token per host, there are many tokens per host. > It is the same scheme used by libketama[4] for consistent hashing among > memcached instances, and is also the original scheme used by Dynamo as > described in [3] before they migrated to their current scheme with fixed > partitions. > > The random assignment of tokens may seem unintuitive given that currently > in Cassandra a random token assigment leads to an unbalanced cluster. With > many virtual nodes, a random token assignment leads to load being evenly > balanced across the hosts in the cluster with high probability. As the > number of virtual nodes is increased, the variance in load across hosts > decreases, as demonstrated by simulation in [5]. > > This scheme has the following properties - (where N is the number of hosts > and B is the total data stored in the cluster): > * placement metadata size is O(N) which is the same as in Cassandra > currently > * partition size is O(B/N) so as data is inserted, if individual partitions > become too large then adding nodes to the cluster reduces this. > * the strategy shares the following properties in common with Cassandra > currently > ** tokens are randomly assigned > ** partitioning is determined by placement (and vice-versa) > ** no two nodes may share the same token > ** when a node leaves the ring, all of its tokens are removed - there is no > exchanging of partitions between nodes > > One design concern is that replicas of a key range are not stored on the > same physical host, as failure of that host could cause the loss of more > than one replica of the data. This will be achieved by using a placement > strategy very similar the the existing NetworkTopologyStrategy, which > treats each individual host the same way as NTS treats a rack - that is > replicas are not assigned to two hosts on the same rack. > > I will shortly create a ticket in JIRA to track discussion of this design. > We have also done some simulation of this scheme to observe the load > balancing properties, node size distribution, cluster resizing and so on. I > will attach some results of this simulation to the JIRA ticket in due > course. > > We are keen to get the ball rolling on this and we look forward to your > input, ideas and recommendations. > > Best Regards, > > Sam Overton > > [1] Staggering repair: https://issues.apache.org/jira/browse/CASSANDRA-3721 > [2] Project Voldemort, Design: http://project-voldemort.com/design.php > [3] Dynamo: > http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf > [4] Ketama: Consistent Hashing: > http://www.audioscrobbler.net/development/ketama/ > [5] Consistent Hashing: > http://www.lexemetech.com/2007/11/consistent-hashing.html > > -- > Sam Overton > Acunu | http://www.acunu.com | @acunu -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com