Do Vnodes address anything besides alleviating cluster planners from doing token range management on nodes manually? Do we have a centralized list of advantages they provide beyond that?
There seem to be lots of downsides. 2i index performance, the above availability, etc. I also wonder if in vnodes (and manually managed tokens... I'll return to this) the node recovery scenarios are being hampered by sstables having the hash ranges of the vnodes intermingled in the same set of sstables. I wondered in another thread in vnodes why sstables are separated into sets by the vnode ranges they represent. For a manually managed contiguous token range, you could separate the sstables into a fixed number of sets, kind of vnode-light. So if there was rebalancing or reconstruction, you could sneakernet or reliably send entire sstable sets that would belong in a range. I also thing this would improve compactions and repairs too. Compactions would be naturally parallelizable in all compaction schemes, and repairs would have natural subsets to do merkle tree calculations. Granted sending sstables might result in "overstreaming" due to data replication across the sstables, but you wouldn't have CPU and random I/O to look up the data. Just sequential transfers. For manually managed tokens with subdivided sstables, if there was rebalancing, you would have the "fringe" edges of the hash range subdivided already, and you would only need to deal with the data in the border areas of the token range, and again could sneakernet / dumb transfer the tables and then let the new node remove the unneeded in future repairs. (Compaction does not remove data that is not longer managed by a node, only repair does? Or does only nodetool clean do that?) Pre-subdivided sstables for manually maanged tokens would REALLY pay big dividends in large-scale cluster expansion. Say you wanted to double or triple the cluster. Since the sstables are already split by some numeric factor that has lots of even divisors (60 for RF 2,3,4,5), you simply bulk copy the already-subdivided sstables for the new nodes' hash ranges and you'd basically be done. In AWS EBS volumes, that could just be a drive detach / drive attach. On Tue, Apr 17, 2018 at 7:37 AM, kurt greaves <k...@instaclustr.com> wrote: > Great write up. Glad someone finally did the math for us. I don't think > this will come as a surprise for many of the developers. Availability is > only one issue raised by vnodes. Load distribution and performance are also > pretty big concerns. > > I'm always a proponent for fixing vnodes, and removing them as a default > until we do. Happy to help on this and we have ideas in mind that at some > point I'll create tickets for... > > On Tue., 17 Apr. 2018, 06:16 Joseph Lynch, <joe.e.ly...@gmail.com> wrote: > > > If the blob link on github doesn't work for the pdf (looks like mobile > > might not like it), try: > > > > > > https://github.com/jolynch/python_performance_toolkit/ > raw/master/notebooks/cassandra_availability/whitepaper/cassandra- > availability-virtual.pdf > > > > -Joey > > < > > https://github.com/jolynch/python_performance_toolkit/ > raw/master/notebooks/cassandra_availability/whitepaper/cassandra- > availability-virtual.pdf > > > > > > > On Mon, Apr 16, 2018 at 1:14 PM, Joseph Lynch <joe.e.ly...@gmail.com> > > wrote: > > > > > Josh Snyder and I have been working on evaluating virtual nodes for > large > > > scale deployments and while it seems like there is a lot of anecdotal > > > support for reducing the vnode count , we couldn't find any concrete > > > math on the topic, so we had some fun and took a whack at quantifying > how > > > different choices of num_tokens impact a Cassandra cluster. > > > > > > According to the model we developed  it seems that at small cluster > > > sizes there isn't much of a negative impact on availability, but when > > > clusters scale up to hundreds of hosts, vnodes have a major impact on > > > availability. In particular, the probability of outage during short > > > failures (e.g. process restarts or failures) or permanent failure (e.g. > > > disk or machine failure) appears to be orders of magnitude higher for > > large > > > clusters. > > > > > > The model attempts to explain why we may care about this and advances a > > > few existing/new ideas for how to fix the scalability problems that > > vnodes > > > fix without the availability (and consistency—due to the effects on > > repair) > > > problems high num_tokens create. We would of course be very interested > in > > > any feedback. The model source code is on github , PRs are welcome > or > > > feel free to play around with the jupyter notebook to match your > > > environment and see what the graphs look like. I didn't attach the pdf > > here > > > because it's too large apparently (lots of pretty graphs). > > > > > > I know that users can always just pick whichever number they prefer, > but > > I > > > think the current default was chosen when token placement was random, > > and I > > > wonder whether it's still the right default. > > > > > > Thank you, > > > -Joey Lynch > > > > > >  https://issues.apache.org/jira/browse/CASSANDRA-13701 > > >  https://github.com/jolynch/python_performance_toolkit/ > > > raw/master/notebooks/cassandra_availability/whitepaper/cassandra- > > > availability-virtual.pdf > > > > > > < > > https://github.com/jolynch/python_performance_toolkit/ > blob/master/notebooks/cassandra_availability/whitepaper/cassandra- > availability-virtual.pdf > > > > > >  https://github.com/jolynch/python_performance_toolkit/tree/m > > > aster/notebooks/cassandra_availability > > > > > >