On Tue, Mar 20, 2012 at 11:24 PM, Jeremiah Jordan < jeremiah.jor...@morningstar.com> wrote:
> So taking a step back, if we want "vnodes" why can't we just give every > node 100 tokens instead of only one? Seems to me this would have less > impact on the rest of the code. It would just look like you had a 500 node > cluster, instead of a 5 node cluster. Your replication strategy would have > to know about the physical machines so that data gets replicated right, but > there is already some concept of this with the data center aware and rack > aware stuff. > > From what I see I think you could get most of the benefits of vnodes by > implementing a new Placement Strategy that did something like this, and you > wouldn't have to touch (and maybe break) code in other places. > > Am I crazy? Naive? > > Once you had this setup, you can start to implement the vnode like stuff > on top of it. Like bootstrapping nodes in one token at a time, and taking > them on from the whole cluster, not just your neighbor. etc. etc. > I second it. Is there some goals we missed which can not be achieved by assigning multiple tokens to a single node? > > -Jeremiah Jordan > > ________________________________________ > From: Rick Branson [rbran...@datastax.com] > Sent: Monday, March 19, 2012 5:16 PM > To: dev@cassandra.apache.org > Subject: Re: RFC: Cassandra Virtual Nodes > > I think if we could go back and rebuild Cassandra from scratch, vnodes > would likely be implemented from the beginning. However, I'm concerned that > implementing them now could be a big distraction from more productive uses > of all of our time and introduce major potential stability issues into what > is becoming a business critical piece of infrastructure for many people. > However, instead of just complaining and pedantry, I'd like to offer a > feasible alternative: > > Has there been consideration given to the idea of a supporting a single > token range for a node? > > While not theoretically as capable as vnodes, it seems to me to be more > practical as it would have a significantly lower impact on the codebase and > provides a much clearer migration path. It also seems to solve a majority > of complaints regarding operational issues with Cassandra clusters. > > Each node would have a lower and an upper token, which would form a range > that would be actively distributed via gossip. Read and replication > requests would only be routed to a replica when the key of these operations > matched the replica's token range in the gossip tables. Each node would > locally store it's own current active token range as well as a target token > range it's "moving" towards. > > As a new node undergoes bootstrap, the bounds would be gradually expanded > to allow it to handle requests for a wider range of the keyspace as it > moves towards it's target token range. This idea boils down to a move from > hard cutovers to smoother operations by gradually adjusting active token > ranges over a period of time. It would apply to token change operations > (nodetool 'move' and 'removetoken') as well. > > Failure during streaming could be recovered at the bounds instead of > restarting the whole process as the active bounds would effectively track > the progress for bootstrap & target token changes. Implicitly these > operations would be throttled to some degree. Node repair (AES) could also > be modified using the same overall ideas provide a more gradual impact on > the cluster overall similar as the ideas given in CASSANDRA-3721. > > While this doesn't spread the load over the cluster for these operations > evenly like vnodes does, this is likely an issue that could be worked > around by performing concurrent (throttled) bootstrap & node repair (AES) > operations. It does allow some kind of "active" load balancing, but clearly > this is not as flexible or as useful as vnodes, but you should be using > RandomPartitioner or sort-of-randomized keys with OPP right? ;) > > As a side note: vnodes fail to provide solutions to node-based limitations > that seem to me to cause a substantial portion of operational issues such > as impact of node restarts / upgrades, GC and compaction induced latency. I > think some progress could be made here by allowing a "pack" of independent > Cassandra nodes to be ran on a single host; somewhat (but nowhere near > entirely) similar to a pre-fork model used by some UNIX-based servers. > > Input? > > -- > Rick Branson > DataStax >