So taking a step back, if we want "vnodes" why can't we just give every node 
100 tokens instead of only one?  Seems to me this would have less impact on the 
rest of the code.  It would just look like you had a 500 node cluster, instead 
of a 5 node cluster.  Your replication strategy would have to know about the 
physical machines so that data gets replicated right, but there is already some 
concept of this with the data center aware and rack aware stuff.

>From what I see I think you could get most of the benefits of vnodes by 
>implementing a new Placement Strategy that did something like this, and you 
>wouldn't have to touch (and maybe break) code in other places.

Am I crazy? Naive?

Once you had this setup, you can start to implement the vnode like stuff on top 
of it.  Like bootstrapping nodes in one token at a time, and taking them on 
from the whole cluster, not just your neighbor. etc. etc.

-Jeremiah Jordan

________________________________________
From: Rick Branson [rbran...@datastax.com]
Sent: Monday, March 19, 2012 5:16 PM
To: dev@cassandra.apache.org
Subject: Re: RFC: Cassandra Virtual Nodes

I think if we could go back and rebuild Cassandra from scratch, vnodes
would likely be implemented from the beginning. However, I'm concerned that
implementing them now could be a big distraction from more productive uses
of all of our time and introduce major potential stability issues into what
is becoming a business critical piece of infrastructure for many people.
However, instead of just complaining and pedantry, I'd like to offer a
feasible alternative:

Has there been consideration given to the idea of a supporting a single
token range for a node?

While not theoretically as capable as vnodes, it seems to me to be more
practical as it would have a significantly lower impact on the codebase and
provides a much clearer migration path. It also seems to solve a majority
of complaints regarding operational issues with Cassandra clusters.

Each node would have a lower and an upper token, which would form a range
that would be actively distributed via gossip. Read and replication
requests would only be routed to a replica when the key of these operations
matched the replica's token range in the gossip tables. Each node would
locally store it's own current active token range as well as a target token
range it's "moving" towards.

As a new node undergoes bootstrap, the bounds would be gradually expanded
to allow it to handle requests for a wider range of the keyspace as it
moves towards it's target token range. This idea boils down to a move from
hard cutovers to smoother operations by gradually adjusting active token
ranges over a period of time. It would apply to token change operations
(nodetool 'move' and 'removetoken') as well.

Failure during streaming could be recovered at the bounds instead of
restarting the whole process as the active bounds would effectively track
the progress for bootstrap & target token changes. Implicitly these
operations would be throttled to some degree. Node repair (AES) could also
be modified using the same overall ideas provide a more gradual impact on
the cluster overall similar as the ideas given in CASSANDRA-3721.

While this doesn't spread the load over the cluster for these operations
evenly like vnodes does, this is likely an issue that could be worked
around by performing concurrent (throttled) bootstrap & node repair (AES)
operations. It does allow some kind of "active" load balancing, but clearly
this is not as flexible or as useful as vnodes, but you should be using
RandomPartitioner or sort-of-randomized keys with OPP right? ;)

As a side note: vnodes fail to provide solutions to node-based limitations
that seem to me to cause a substantial portion of operational issues such
as impact of node restarts / upgrades, GC and compaction induced latency. I
think some progress could be made here by allowing a "pack" of independent
Cassandra nodes to be ran on a single host; somewhat (but nowhere near
entirely) similar to a pre-fork model used by some UNIX-based servers.

Input?

--
Rick Branson
DataStax

Reply via email to