There's a lot going on here... hopefully I can respond to everything in a
coherent manner.

> Perhaps a simple way to avoid this is to update the random allocation
algorithm to re-generate tokens when the ranges created do not have a good
size distribution?

Instead of using random tokens for the first node, I think we'd be better
off picking a random initial token then using an even distribution around
the ring, using the first token as an offset.  The main benefit of random
is that we don't get collisions, not the distribution.  I haven't read
through the change in CASSANDRA-15600, maybe it addresses this problem
already, if so we can ignore my suggestion here.

> Clusters where we have used num_tokens 4 we have regretted.
> While we accept the validity and importance of the increased availability
provided by num_tokens 4, we have never seen or used it in practice.

While we worked together, I personally moved quite a few clusters to 4
tokens, and didn't run into any balance issues.  I'm not sure why you're
saying you've never seen it in practice, I did it with a whole bunch of our
clients.

Mick Said:

> We know of a number of production clusters that have been set up this
way. I am unaware of any Cassandra docs or community recommendations that
say you should avoid doing this. So, this is a problem regardless of the
value for num_tokens.

Paulo:

> Having the number of racks not a multiple of the replication factor is
not a good practice since it can lead to imbalance and other problems like
this, so we should not only document this but perhaps add a warning or even
hard fail when this is encountered during node startup?

Agreed on both the above - I intend to document this in CASSANDRA-15618.

Mick, from your test:

>  Each cluster was configured with one rack.

This is an important nuance of the results you're seeing.  It sounds like
the test covers the edge case of using a single rack / AZ for an entire
cluster.  I can't remember too many times where I actually saw this, of the
several hundred clusters I looked at over the almost 4 years I was at TLP.
   This isn't to say it's not out there in the wild, but I don't think it
should drive us to pick a token count.  We can probably do better than
using a completely random algorithm for the corner case of using a single
rack or fewer racks than RF, and we should also encourage people to run
Cassandra in a way that doesn't set themselves up for a gunshot to the foot.

In a world of tradeoffs, I'm still not convinced that 16 tokens makes any
sense as a default.  Assuming we can fix the worst case random imbalance in
small clusters, 4 is a significantly better option as it will make it
easier for teams to scale Cassandra out the way we claim they can.  Using
16 tokens brings an unnecessary (and probably unknown) ceiling to people's
abilities to scale and for the *majority* of clusters where people pick
Cassandra for scalability and availability it's still too high.  I'd rather
us put a default that works best for the majority of people and document
the cases where people might want to deviate from it, rather than picking a
somewhat crappy (but better than 256) default.

That said, we don't have the better token distribution yet, so if we're
going to assume people just put C* in production with minimal configuration
changes, 16 will help us deal with the imbalance issues *today*.  We know
it works better than 256, so I'm willing to take this as a win *today*, on
the assumption that folks are OK changing this value again before we
release 4.0 if we find we can make it work without the super sharp edges
that we can currently stab ourselves with.  I'd much rather ship C* with 16
tokens than 256, and I don't want to keep debating this so much we don't
end up making any change at all.

I propose we drop it to 16 immediately.  I'll add the production docs
in CASSANDRA-15618 with notes on token count, the reasons why you'd want 1,
4, or 16.  As a follow up, if we can get a token simulation written we can
try all sorts of topologies with whatever token algorithms we want.  Once
that simulation is written and we've got some reports we can revisit.

Eventually we'll probably need to add the ability for folks to fix cluster
imbalances without adding / removing hardware, but I suspect we've got a
fair amount of plumbing to rework to make something like that doable.

Jon


On Mon, Mar 9, 2020 at 5:03 AM Paulo Motta <pauloricard...@gmail.com> wrote:

> Great investigation, good job guys!
>
> > Personally I would have liked to have seen even more iterations. While
> 14 run iterations gives an indication, the average of randomness is not
> what is important here. What concerns me is the consequence to imbalances
> as the cluster grows when you're very unlucky with initial random tokens,
> for example when random tokens land very close together. The token
> allocation can deal with breaking up large token ranges but is unable to do
> anything about such tiny token ranges. Even a bad 1-in-a-100 experience
> should be a consideration when picking a default num_tokens.
>
> Perhaps a simple way to avoid this is to update the random allocation
> algorithm to re-generate tokens when the ranges created do not have a good
> size distribution?
>
> > But it can be worse, for example if you have RF=3 and only two racks
> then you will only get random tokens. We know of a number of production
> clusters that have been set up this way. I am unaware of any Cassandra docs
> or community recommendations that say you should avoid doing this. So, this
> is a problem regardless of the value for num_tokens.
>
> Having the number of racks not a multiple of the replication factor is not
> a good practice since it can lead to imbalance and other problems like
> this, so we should not only document this but perhaps add a warning or even
> hard fail when this is encountered during node startup?
>
> Cheers,
>
> Paulo
>
> Em seg., 9 de mar. de 2020 às 08:25, Mick Semb Wever <m...@apache.org>
> escreveu:
>
>>
>> > Can we ask for some analysis and data against the risks different
>> > num_tokens choices present. We shouldn't rush into a new default, and
>> such
>> > background information and data is operator value added.
>>
>>
>> Thanks for everyone's patience on this topic.
>> The following is further input on a number of fronts.
>>
>>
>> ** Analysis of Token Distributions
>>
>> The following is work done by Alex Dejanovski and Anthony Grasso. It
>> builds upon their previous work at The Last Pickle and why we recommend 16
>> as the best value to clients. (Please buy beers for these two for the
>> effort they have done here.)
>>
>> The following three graphs show the ranges of imbalance that occur on
>> clusters growing from 4 nodes to 12 nodes, for the different values of
>> num_tokens: 4, 8 and 16. The range is based on 14 run iterations (except 16
>> which only got ten).
>>
>>
>> num_tokens: 4
>>
>>
>> num_tokens: 8
>>
>>
>> num_tokens: 16
>>
>> These graphs were generated using clusters created in AWS by tlp-cluster (
>> https://github.com/thelastpickle/tlp-cluster). A script was written to
>> automate the testing and generate the data for each value of num_tokens.
>> Each cluster was configured with one rack.  Of course these interpretations
>> are debatable. The data to the graphs is in
>> https://docs.google.com/spreadsheets/d/1gPZpSOUm3_pSCo9y-ZJ8WIctpvXNr5hDdupJ7K_9PHY/edit?usp=sharing
>>
>>
>> What I see from these graphs is…
>>  a)  token allocation is pretty good are fixing initial bad random token
>> imbalances. By the time you are at 12 nodes, presuming you have setup the
>> cluster correctly so that token allocation actually works, your nodes will
>> be balanced with num_tokens 4 or greater.
>>  b) you need to get to ~12 nodes with num_tokens 4 to have a good balance.
>>  c) you need to get to ~9 nodes with num_token 8 to have a good balance.
>>  d) you need to get to ~6 nodes with num_tokens 16 to have a good balance.
>>
>> Personally I would have liked to have seen even more iterations. While 14
>> run iterations gives an indication, the average of randomness is not what
>> is important here. What concerns me is the consequence to imbalances as the
>> cluster grows when you're very unlucky with initial random tokens, for
>> example when random tokens land very close together. The token allocation
>> can deal with breaking up large token ranges but is unable to do anything
>> about such tiny token ranges. Even a bad 1-in-a-100 experience should be a
>> consideration when picking a default num_tokens.
>>
>>
>> ** When does the Token Allocation work…
>>
>> This has been touched on already in this thread. There are cases where
>> token allocation fails to kick in. The first node in up to RF racks
>> generates random tokens, this typically means the first three nodes.
>>
>> But it can be worse, for example if you have RF=3 and only two racks then
>> you will only get random tokens. We know of a number of production clusters
>> that have been set up this way. I am unaware of any Cassandra docs or
>> community recommendations that say you should avoid doing this. So, this is
>> a problem regardless of the value for num_tokens.
>>
>>
>> ** Algorithmic token allocation does not handle the racks = RF case well (
>> CASSANDRA-15600 <https://issues.apache.org/jira/browse/CASSANDRA-15600>)
>>
>> This recently landed in trunk. My understanding is that this improves the
>> situation the graphs cover, but not the situation just described where a DC
>> has 1>racks>RF.  Ekaterina, maybe you could elaborate?
>>
>>
>> ** Decommissioning Nodes
>>
>> Elasticity is a feature to Cassandra. The operational costs to Cassandra
>> are a real consideration. A reduction from a 9 node cluster back to a 6
>> node cluster does happen often enough. Decommissioning nodes on smaller
>> clusters have the greatest operational cost savings yet will suffer most
>> from too low a num_tokens setup.
>>
>>
>> ** Recommendations from Cassandra Consulting Companies
>>
>> My understanding is that DataStax recommends num_tokens 8, while
>> Instaclustr and The Last Pickle have both recommended 16. Interestingly
>> enough those that are pushing for num_tokens 4,  are using today num_tokens
>> 1 (and are already sitting with a lot of in-house C* experience).
>>
>>
>> ** Keeping it Real
>>
>> Clusters where we have used num_tokens 4 we have regretted. This and past
>> analysis work, similar to above, had led us to use 16 num_tokens. Cost
>> optimisation of clusters is one of the key user concerns out there, and we
>> have witnessed problems on this front with num_tokens 4.
>>
>> While we accept the validity and importance of the increased availability
>> provided by num_tokens 4, we have never seen or used it in practice. The
>> default value of num_tokens is important. The value of 256 has been good
>> business for consultants, it was a bad choice for clusters and difficult to
>> change. A new default should be chosen wisely.
>>
>>
>> regards,
>> Mick, Anthony, Alex
>>
>>

Reply via email to