Hello cassandra-dev, This is a long email. It concerns a significant change to Cassandra, so deserves a thorough introduction.
*The summary is*: we believe virtual nodes are the way forward. We would like to add virtual nodes to Cassandra and we are asking for comments, criticism and collaboration! Cassandra's current partitioning scheme is sub-optimal for bootstrap, decommission, repair and re-balance operations, and places the burden on users to properly calculate tokens (a common cause of mistakes), which is a recurring pain-point. Virtual nodes have a variety of benefits over the one-to-one mapping of host to key range which Cassandra currently supports. Among these benefits are: * Even load balancing when growing and shrinking the cluster A virtual node scheme ensures that all hosts in a cluster have an even portion of the total data, and a new node bootstrapped into the cluster will assume its share of the data. Doubling, or halving the cluster to ensure even load distribution would no longer be necessary. * Distributed rebuild When sizing a cluster, one of the considerations is the amount of time required to recover from a failed node. This is the exposure time, during which a secondary failure could cause data loss. In order to guarantee an upper bound on the exposure time, the amount of data which can be stored on each host is limited by the amount of time taken to recover the required replica count. At Acunu we have found that the exposure time is frequently the limiting factor which dictates the maximum allowed node size in customers' clusters. Using a virtual node scheme, the data stored on one host is not replicated on just RF-1 other physical hosts. Each virtual node is replicated to RF-1 other virtual nodes which may be on a different set of physical hosts to replicas of other virtual nodes stored on the same host. This means data for one host is replicated evenly across the entire cluster. In the event of a failure then, restoring the replica count can be done in a fully distributed way. Each host in the cluster participates in the rebuild, drastically reducing the exposure time, allowing more data to be stored on a single host while still maintaining an acceptable upper bound on the likelihood of secondary failure. This reduces TCO concerns. * Greater failure tolerance in streaming Operations which require streaming of a large range of data, eg. bootstrap, decommission, repair, etc. incur a heavy cost if an error (eg. dropped network connection) is encountered during the streaming. Currently the whole range must be re-streamed, and this could constitute a very large amount of data. Virtual nodes reduce the impact of streaming failures, since each virtual node is a much smaller range of the key-space, so re-streaming a whole virtual node is a much cheaper process. * Evenly distributed impact of streaming operations Streaming operations such as bootstrap, repair, et al. would involve every node in the cluster. This would distribute the load of these operations across the whole cluster, and could be staggered so that only a small subset of nodes were affected at once, similar to staggered repair. * Possibility for active load balancing Load balancing in Cassandra currently involves moving a token to increase/reduce the amount of key-space for which a host is responsible. This only allows load balancing between neighbouring nodes, so it could involve moving more than one token just to redistribute a single overloaded node. Virtual nodes could allow load balancing on a much finer granularity, so heavily loaded portions of the key-space could be redistributed to lighter-loaded hosts by reassigning one or more virtual nodes. Implementing a virtual node scheme in Cassandra is not an insignificant amount of work, and it will touch a large amount of the codebase related to partitioning, placement, routing, gossip, and so on. We do believe that this is possible to do incrementally, and in such a way that there is an easy upgrade path for pre-virtual-node deployments. It would not however touch the storage layer. The virtual node concept is solely for partitioning and placement, not for segregating the data storage of the host, so all keys for all virtual nodes on a host would be stored in the same SSTables. We are not proposing the adoption of the same scheme used by Voldemort and described in the Dynamo paper. We feel this scheme is too different from Cassandra's current distribution model to be a viable target for incremental development. Their scheme also fixes the number of virtual nodes for the lifetime of the cluster, which can prove to be a ceiling to scaling the cluster if the virtual nodes grow too large. The proposed design is: * Assign each host T random tokens. * A partition is assigned to a host for each of its tokens, where the partition is defined by the interval between a token and the previous token on the ring. * When a host joins the ring it is assigned T random tokens which will result in a portion of an existing partition being assigned to that host. * When a host leaves the ring it relinquishes its tokens which will result in its partitions becoming part of the neighbouring partitions. This is just a basic extension of Cassandra's existing distribution model, where instead of having 1 token per host, there are many tokens per host. It is the same scheme used by libketama for consistent hashing among memcached instances, and is also the original scheme used by Dynamo as described in  before they migrated to their current scheme with fixed partitions. The random assignment of tokens may seem unintuitive given that currently in Cassandra a random token assigment leads to an unbalanced cluster. With many virtual nodes, a random token assignment leads to load being evenly balanced across the hosts in the cluster with high probability. As the number of virtual nodes is increased, the variance in load across hosts decreases, as demonstrated by simulation in . This scheme has the following properties - (where N is the number of hosts and B is the total data stored in the cluster): * placement metadata size is O(N) which is the same as in Cassandra currently * partition size is O(B/N) so as data is inserted, if individual partitions become too large then adding nodes to the cluster reduces this. * the strategy shares the following properties in common with Cassandra currently ** tokens are randomly assigned ** partitioning is determined by placement (and vice-versa) ** no two nodes may share the same token ** when a node leaves the ring, all of its tokens are removed - there is no exchanging of partitions between nodes One design concern is that replicas of a key range are not stored on the same physical host, as failure of that host could cause the loss of more than one replica of the data. This will be achieved by using a placement strategy very similar the the existing NetworkTopologyStrategy, which treats each individual host the same way as NTS treats a rack - that is replicas are not assigned to two hosts on the same rack. I will shortly create a ticket in JIRA to track discussion of this design. We have also done some simulation of this scheme to observe the load balancing properties, node size distribution, cluster resizing and so on. I will attach some results of this simulation to the JIRA ticket in due course. We are keen to get the ball rolling on this and we look forward to your input, ideas and recommendations. Best Regards, Sam Overton  Staggering repair: https://issues.apache.org/jira/browse/CASSANDRA-3721  Project Voldemort, Design: http://project-voldemort.com/design.php  Dynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf  Ketama: Consistent Hashing: http://www.audioscrobbler.net/development/ketama/  Consistent Hashing: http://www.lexemetech.com/2007/11/consistent-hashing.html -- Sam Overton Acunu | http://www.acunu.com | @acunu