Ben did a talk <https://www.youtube.com/watch?v=mMZBvPXAhzU&index=39&list=PLqcm6qE9lgKJkxYZUOIykswDndrOItnn2> that might have some useful information. It's much more complicated with vnodes though and I doubt you'll be able to get it to be as rapid as you'd want.
sets up schema to match This shouldn't be necessary. You'd just join the node as usual but with auto_bootstrap: false and let the schema be propagated. Is there an issue if the vnodes tokens for two nodes are identical? Do they > have to be distinct for each node? Yeah. This is annoying I know. The new node will take over the tokens of the old node, which you don't want. > Basically, I was wondering if we just use this to double the number of > nodes with identical copies of the node data via snapshots, and then later > on cassandra can pare down which nodes own which data. There wouldn't be much point to adding nodes with the same (or almost the same) tokens. That would just be shifting load. You'd essentially need a very smart allocation algorithm to come up with good token ranges, but then you still have the problem of tracking down the relevant SSTables from the nodes. Basically, bootstrap does this for you ATM and only streams the relevant sections of SSTables for the new node. If you were doing it from backups/snapshots you'd need to either do the same thing (eek) or copy all the SSTables from all the relevant nodes. With single token nodes this becomes much easier. You can likely get away with only copying around double/triple the data (depending on how you add tokens to the ring and RF and node count). I'll just put it out there that C* is a database and really isn't designed to be rapidly scalable. If you're going to try, be prepared to invest A LOT of time into it.