So taking a step back, if we want "vnodes" why can't we just give every node 100 tokens instead of only one? Seems to me this would have less impact on the rest of the code. It would just look like you had a 500 node cluster, instead of a 5 node cluster. Your replication strategy would have to know about the physical machines so that data gets replicated right, but there is already some concept of this with the data center aware and rack aware stuff.
>From what I see I think you could get most of the benefits of vnodes by >implementing a new Placement Strategy that did something like this, and you >wouldn't have to touch (and maybe break) code in other places. Am I crazy? Naive? Once you had this setup, you can start to implement the vnode like stuff on top of it. Like bootstrapping nodes in one token at a time, and taking them on from the whole cluster, not just your neighbor. etc. etc. -Jeremiah Jordan ________________________________________ From: Rick Branson [rbran...@datastax.com] Sent: Monday, March 19, 2012 5:16 PM To: dev@cassandra.apache.org Subject: Re: RFC: Cassandra Virtual Nodes I think if we could go back and rebuild Cassandra from scratch, vnodes would likely be implemented from the beginning. However, I'm concerned that implementing them now could be a big distraction from more productive uses of all of our time and introduce major potential stability issues into what is becoming a business critical piece of infrastructure for many people. However, instead of just complaining and pedantry, I'd like to offer a feasible alternative: Has there been consideration given to the idea of a supporting a single token range for a node? While not theoretically as capable as vnodes, it seems to me to be more practical as it would have a significantly lower impact on the codebase and provides a much clearer migration path. It also seems to solve a majority of complaints regarding operational issues with Cassandra clusters. Each node would have a lower and an upper token, which would form a range that would be actively distributed via gossip. Read and replication requests would only be routed to a replica when the key of these operations matched the replica's token range in the gossip tables. Each node would locally store it's own current active token range as well as a target token range it's "moving" towards. As a new node undergoes bootstrap, the bounds would be gradually expanded to allow it to handle requests for a wider range of the keyspace as it moves towards it's target token range. This idea boils down to a move from hard cutovers to smoother operations by gradually adjusting active token ranges over a period of time. It would apply to token change operations (nodetool 'move' and 'removetoken') as well. Failure during streaming could be recovered at the bounds instead of restarting the whole process as the active bounds would effectively track the progress for bootstrap & target token changes. Implicitly these operations would be throttled to some degree. Node repair (AES) could also be modified using the same overall ideas provide a more gradual impact on the cluster overall similar as the ideas given in CASSANDRA-3721. While this doesn't spread the load over the cluster for these operations evenly like vnodes does, this is likely an issue that could be worked around by performing concurrent (throttled) bootstrap & node repair (AES) operations. It does allow some kind of "active" load balancing, but clearly this is not as flexible or as useful as vnodes, but you should be using RandomPartitioner or sort-of-randomized keys with OPP right? ;) As a side note: vnodes fail to provide solutions to node-based limitations that seem to me to cause a substantial portion of operational issues such as impact of node restarts / upgrades, GC and compaction induced latency. I think some progress could be made here by allowing a "pack" of independent Cassandra nodes to be ran on a single host; somewhat (but nowhere near entirely) similar to a pre-fork model used by some UNIX-based servers. Input? -- Rick Branson DataStax