Re: Configuration for new(expanding) cluster and new admins.

2022-06-15 Thread Elliott Sims
If you set a different num_tokens value for new hosts (the value should
never be changed on an existing host), the amount of data moved to that
host will be proportional to the num_tokens value.  So, if the new hosts
are set to 32 when they're added to the cluster, those hosts will get twice
as much data as the initial 16-token hosts.

I think it's generally advised to keep a Cassandra cluster identical in
terms of hardware and num_tokens, at least within a DC.  I suspect having a
lot of different values would slow down Reaper significantly, but I've had
decent results so far adding a few hosts with beefier hardware and
num_tokens=32 to an existing 16-token cluster.

On Wed, Jun 15, 2022 at 1:33 AM Marc Hoppins  wrote:

> Hi all,
>
> Say we have 2 datacentres with 12 nodes in each. All hardware is the same.
>
> 4-core, 2 x HDD (eg, 4TiB)
>
> num_tokens = 16 as a start point
>
> If a plan is to gradually increase the nodes per DC, and new hardware will
> have more of everything, especially storage, I assume I increase the
> num_tokens value.  Should I have started with a lower value?
>
> What would be considered as a good adjustment for:
>
> Any increase in number of HDD for any node?
>
> Any increase in capacity per HDD for any node?
>
> Is there any direct correlation between new token count and the
> proportional increase in either quantity of devices or total capacity, or
> is any adjustment purely arbitrary just to differentiate between varied
> nodes?
>
> Thanks
>
> M
>

-- 
This email, including its contents and any attachment(s), may contain 
confidential and/or proprietary information and is solely for the review 
and use of the intended recipient(s). If you have received this email in 
error, please notify the sender and permanently delete this email, its 
content, and any attachment(s).  Any disclosure, copying, or taking of any 
action in reliance on an email received in error is strictly prohibited.


Re: Configuration for new(expanding) cluster and new admins.

2022-06-15 Thread Jeff Jirsa
You shouldn't need to change num_tokens at all.  num_tokens helps you
pretend your cluster is a bigger than it is and randomly selects tokens for
you so that your data is approximately evenly distributed. As you add more
hosts, it should balance out automatically.

The alternative to num_tokens is to use a single token and explicitly
calculate it each time to ensure the cluster is properly balanced, and then
using `nodetool move` each time you add hosts to the cluster to
re-distribute load. num_tokens makes it less likely that you end up
imbalanced, so you shouldn't need to move any tokens manually.



On Wed, Jun 15, 2022 at 12:34 AM Marc Hoppins  wrote:

> Hi all,
>
> Say we have 2 datacentres with 12 nodes in each. All hardware is the same.
>
> 4-core, 2 x HDD (eg, 4TiB)
>
> num_tokens = 16 as a start point
>
> If a plan is to gradually increase the nodes per DC, and new hardware will
> have more of everything, especially storage, I assume I increase the
> num_tokens value.  Should I have started with a lower value?
>
> What would be considered as a good adjustment for:
>
> Any increase in number of HDD for any node?
>
> Any increase in capacity per HDD for any node?
>
> Is there any direct correlation between new token count and the
> proportional increase in either quantity of devices or total capacity, or
> is any adjustment purely arbitrary just to differentiate between varied
> nodes?
>
> Thanks
>
> M
>


Re: more nodes than vnodes

2022-06-15 Thread Luca Rondanini
Awesome, thank you so much! I completely missed the part "the token range
that it hits will be split", now everything makes sense!

Again, thanks a lot for your help!

Luca


On Wed, Jun 15, 2022 at 1:04 AM Hannu Kröger  wrote:

> Adding a token (which in essence is a vnode) means that the token range
> that it hits will be split into two. And that data range which has a new
> owner will be replicated to the new owner node. If there are a lot of
> tokens (=vnodes) in the cluster, adding some amount of vnodes (e.g.
> num_tokens=16) is going to affect that amount (e.g. 16) of existing ranges
> but if there are a lot of tokens, each range is relatively small and
> distributed across the cluster.
>
>
> A very naive example:
> Cluster has 100 nodes and 100GB data with replication factor=3 => 300GB
> data altogether. Each node will have ~3GB data. num_tokens is let’s say
> 256. In the cluster there would be 256*100 => 25600 tokens altogether.
> You add one more node and let’s imagine that tokens are perfectly
> distributed, in the future each node will contain 2.97GB of data.
>
> When that new node is joining, those 256 tokens are (hopefully)
> distributed evenly and each of those 100 nodes will replicate ~0.03GB of
> data to that new node so that it will eventually have that 2.97GB of data.
> And the cluster would have 25856 tokens after the scaling out operation.
> And only 256 existing token ranges would be changed, not all 25600 when a
> new node is joining.
>
> So you see that for each node it’s only 30mb to replicate to the new node.
> Not very expensive, right?
>
> In real life, it’s not so precise and all but the basic idea is the same.
>
> Cheers,
> Hannu
>
> On 15. Jun 2022, at 10.32, Luca Rondanini 
> wrote:
>
> Thanks a lot Hannu,
>
> really helpful! But isn't that crazy expensive? adding a vnode means that
> every vnode in the cluster will have a different range of tokens which
> means a lot of data will need to be moved around.
>
> Thanks again,
> Luca
>
>
>
> On Wed, Jun 15, 2022 at 12:25 AM Hannu Kröger  wrote:
>
>> When a node joins a cluster, it gets (semi-)random tokens based on
>> num_tokens value.
>>
>> Total amount of vnodes is not fixed. I don’t remember top of my hat if
>> num_tokens can be different on each node but whenever you add a node, new
>> vnodes get “created”. Existing token ranges will be split and some range
>> will be allocated for the new node and data is being replicated to the
>> joining node. So if you have num_tokens set to a higher value like 16 or
>> so, adding and removing a single node in a cluster is standard operation
>> and although it causes some load on the cluster, it should be somewhat
>> evenly distributed among other nodes. If you have just a single token per
>> node then scaling up or down has a bit different effects due to balancing
>> issues etc. So there is a reason why default num_tokens is 16 currently.
>>
>> Cheers,
>> Hannu
>>
>> On 15. Jun 2022, at 10.12, Luca Rondanini 
>> wrote:
>>
>> ok, that makes sense, but does the partitioner add vnodes? is the number
>> of vnodes fixed in a cluster?
>>
>> On Wed, Jun 15, 2022 at 12:10 AM Hannu Kröger  wrote:
>>
>>> Hey,
>>>
>>> num_tokens is tokens per node.
>>>
>>> So in your case you would have 15 vnodes altogether.
>>>
>>> Cheers,
>>> Hannu
>>>
>>> > On 15. Jun 2022, at 10.08, Luca Rondanini 
>>> wrote:
>>> >
>>> > Hi all,
>>> >
>>> > I'm just trying to understand better how cassandra works.
>>> >
>>> > My understanding is that, once set, the number of vnodes does not
>>> change in a cluster. The partitioner allocates vnodes to nodes ensuring
>>> replication data are not stored on the same node.
>>> >
>>> > But what happens if there are more nodes than vnodes? If I set
>>> num_tokens to 3 and I have 5 servers? Unless the partitioner adds vnodes
>>> and moves data around but it seems an extremely expensive operation. I'm
>>> sure I'm missing something, I'm not quite sure what! :)
>>> >
>>> > Thanks,
>>> > Luca
>>> >
>>>
>>>
>>
>


Re: more nodes than vnodes

2022-06-15 Thread Hannu Kröger
Adding a token (which in essence is a vnode) means that the token range that it 
hits will be split into two. And that data range which has a new owner will be 
replicated to the new owner node. If there are a lot of tokens (=vnodes) in the 
cluster, adding some amount of vnodes (e.g. num_tokens=16) is going to affect 
that amount (e.g. 16) of existing ranges but if there are a lot of tokens, each 
range is relatively small and distributed across the cluster.


A very naive example:
Cluster has 100 nodes and 100GB data with replication factor=3 => 300GB data 
altogether. Each node will have ~3GB data. num_tokens is let’s say 256. In the 
cluster there would be 256*100 => 25600 tokens altogether.
You add one more node and let’s imagine that tokens are perfectly distributed, 
in the future each node will contain 2.97GB of data.

When that new node is joining, those 256 tokens are (hopefully) distributed 
evenly and each of those 100 nodes will replicate ~0.03GB of data to that new 
node so that it will eventually have that 2.97GB of data. And the cluster would 
have 25856 tokens after the scaling out operation. And only 256 existing token 
ranges would be changed, not all 25600 when a new node is joining.

So you see that for each node it’s only 30mb to replicate to the new node. Not 
very expensive, right?

In real life, it’s not so precise and all but the basic idea is the same.

Cheers,
Hannu

> On 15. Jun 2022, at 10.32, Luca Rondanini  wrote:
> 
> Thanks a lot Hannu,
> 
> really helpful! But isn't that crazy expensive? adding a vnode means that 
> every vnode in the cluster will have a different range of tokens which means 
> a lot of data will need to be moved around. 
> 
> Thanks again, 
> Luca
> 
> 
> 
> On Wed, Jun 15, 2022 at 12:25 AM Hannu Kröger  > wrote:
> When a node joins a cluster, it gets (semi-)random tokens based on num_tokens 
> value.
> 
> Total amount of vnodes is not fixed. I don’t remember top of my hat if 
> num_tokens can be different on each node but whenever you add a node, new 
> vnodes get “created”. Existing token ranges will be split and some range will 
> be allocated for the new node and data is being replicated to the joining 
> node. So if you have num_tokens set to a higher value like 16 or so, adding 
> and removing a single node in a cluster is standard operation and although it 
> causes some load on the cluster, it should be somewhat evenly distributed 
> among other nodes. If you have just a single token per node then scaling up 
> or down has a bit different effects due to balancing issues etc. So there is 
> a reason why default num_tokens is 16 currently.
> 
> Cheers,
> Hannu
> 
>> On 15. Jun 2022, at 10.12, Luca Rondanini > > wrote:
>> 
>> ok, that makes sense, but does the partitioner add vnodes? is the number of 
>> vnodes fixed in a cluster?
>> 
>> On Wed, Jun 15, 2022 at 12:10 AM Hannu Kröger > > wrote:
>> Hey,
>> 
>> num_tokens is tokens per node.
>> 
>> So in your case you would have 15 vnodes altogether.
>> 
>> Cheers,
>> Hannu
>> 
>> > On 15. Jun 2022, at 10.08, Luca Rondanini > > > wrote:
>> > 
>> > Hi all,
>> > 
>> > I'm just trying to understand better how cassandra works. 
>> > 
>> > My understanding is that, once set, the number of vnodes does not change 
>> > in a cluster. The partitioner allocates vnodes to nodes ensuring 
>> > replication data are not stored on the same node.
>> > 
>> > But what happens if there are more nodes than vnodes? If I set num_tokens 
>> > to 3 and I have 5 servers? Unless the partitioner adds vnodes and moves 
>> > data around but it seems an extremely expensive operation. I'm sure I'm 
>> > missing something, I'm not quite sure what! :)
>> > 
>> > Thanks,
>> > Luca
>> > 
>> 
> 



Configuration for new(expanding) cluster and new admins.

2022-06-15 Thread Marc Hoppins
Hi all,

Say we have 2 datacentres with 12 nodes in each. All hardware is the same.

4-core, 2 x HDD (eg, 4TiB)

num_tokens = 16 as a start point

If a plan is to gradually increase the nodes per DC, and new hardware will have 
more of everything, especially storage, I assume I increase the num_tokens 
value.  Should I have started with a lower value?

What would be considered as a good adjustment for:

Any increase in number of HDD for any node?

Any increase in capacity per HDD for any node?

Is there any direct correlation between new token count and the proportional 
increase in either quantity of devices or total capacity, or is any adjustment 
purely arbitrary just to differentiate between varied nodes?

Thanks

M


Re: more nodes than vnodes

2022-06-15 Thread Luca Rondanini
Thanks a lot Hannu,

really helpful! But isn't that crazy expensive? adding a vnode means that
every vnode in the cluster will have a different range of tokens which
means a lot of data will need to be moved around.

Thanks again,
Luca



On Wed, Jun 15, 2022 at 12:25 AM Hannu Kröger  wrote:

> When a node joins a cluster, it gets (semi-)random tokens based on
> num_tokens value.
>
> Total amount of vnodes is not fixed. I don’t remember top of my hat if
> num_tokens can be different on each node but whenever you add a node, new
> vnodes get “created”. Existing token ranges will be split and some range
> will be allocated for the new node and data is being replicated to the
> joining node. So if you have num_tokens set to a higher value like 16 or
> so, adding and removing a single node in a cluster is standard operation
> and although it causes some load on the cluster, it should be somewhat
> evenly distributed among other nodes. If you have just a single token per
> node then scaling up or down has a bit different effects due to balancing
> issues etc. So there is a reason why default num_tokens is 16 currently.
>
> Cheers,
> Hannu
>
> On 15. Jun 2022, at 10.12, Luca Rondanini 
> wrote:
>
> ok, that makes sense, but does the partitioner add vnodes? is the number
> of vnodes fixed in a cluster?
>
> On Wed, Jun 15, 2022 at 12:10 AM Hannu Kröger  wrote:
>
>> Hey,
>>
>> num_tokens is tokens per node.
>>
>> So in your case you would have 15 vnodes altogether.
>>
>> Cheers,
>> Hannu
>>
>> > On 15. Jun 2022, at 10.08, Luca Rondanini 
>> wrote:
>> >
>> > Hi all,
>> >
>> > I'm just trying to understand better how cassandra works.
>> >
>> > My understanding is that, once set, the number of vnodes does not
>> change in a cluster. The partitioner allocates vnodes to nodes ensuring
>> replication data are not stored on the same node.
>> >
>> > But what happens if there are more nodes than vnodes? If I set
>> num_tokens to 3 and I have 5 servers? Unless the partitioner adds vnodes
>> and moves data around but it seems an extremely expensive operation. I'm
>> sure I'm missing something, I'm not quite sure what! :)
>> >
>> > Thanks,
>> > Luca
>> >
>>
>>
>


Re: more nodes than vnodes

2022-06-15 Thread Hannu Kröger
When a node joins a cluster, it gets (semi-)random tokens based on num_tokens 
value.

Total amount of vnodes is not fixed. I don’t remember top of my hat if 
num_tokens can be different on each node but whenever you add a node, new 
vnodes get “created”. Existing token ranges will be split and some range will 
be allocated for the new node and data is being replicated to the joining node. 
So if you have num_tokens set to a higher value like 16 or so, adding and 
removing a single node in a cluster is standard operation and although it 
causes some load on the cluster, it should be somewhat evenly distributed among 
other nodes. If you have just a single token per node then scaling up or down 
has a bit different effects due to balancing issues etc. So there is a reason 
why default num_tokens is 16 currently.

Cheers,
Hannu

> On 15. Jun 2022, at 10.12, Luca Rondanini  wrote:
> 
> ok, that makes sense, but does the partitioner add vnodes? is the number of 
> vnodes fixed in a cluster?
> 
> On Wed, Jun 15, 2022 at 12:10 AM Hannu Kröger  > wrote:
> Hey,
> 
> num_tokens is tokens per node.
> 
> So in your case you would have 15 vnodes altogether.
> 
> Cheers,
> Hannu
> 
> > On 15. Jun 2022, at 10.08, Luca Rondanini  > > wrote:
> > 
> > Hi all,
> > 
> > I'm just trying to understand better how cassandra works. 
> > 
> > My understanding is that, once set, the number of vnodes does not change in 
> > a cluster. The partitioner allocates vnodes to nodes ensuring replication 
> > data are not stored on the same node.
> > 
> > But what happens if there are more nodes than vnodes? If I set num_tokens 
> > to 3 and I have 5 servers? Unless the partitioner adds vnodes and moves 
> > data around but it seems an extremely expensive operation. I'm sure I'm 
> > missing something, I'm not quite sure what! :)
> > 
> > Thanks,
> > Luca
> > 
> 



Re: more nodes than vnodes

2022-06-15 Thread Luca Rondanini
ok, that makes sense, but does the partitioner add vnodes? is the number of
vnodes fixed in a cluster?

On Wed, Jun 15, 2022 at 12:10 AM Hannu Kröger  wrote:

> Hey,
>
> num_tokens is tokens per node.
>
> So in your case you would have 15 vnodes altogether.
>
> Cheers,
> Hannu
>
> > On 15. Jun 2022, at 10.08, Luca Rondanini 
> wrote:
> >
> > Hi all,
> >
> > I'm just trying to understand better how cassandra works.
> >
> > My understanding is that, once set, the number of vnodes does not change
> in a cluster. The partitioner allocates vnodes to nodes ensuring
> replication data are not stored on the same node.
> >
> > But what happens if there are more nodes than vnodes? If I set
> num_tokens to 3 and I have 5 servers? Unless the partitioner adds vnodes
> and moves data around but it seems an extremely expensive operation. I'm
> sure I'm missing something, I'm not quite sure what! :)
> >
> > Thanks,
> > Luca
> >
>
>


Re: more nodes than vnodes

2022-06-15 Thread Hannu Kröger
Hey,

num_tokens is tokens per node.

So in your case you would have 15 vnodes altogether.

Cheers,
Hannu

> On 15. Jun 2022, at 10.08, Luca Rondanini  wrote:
> 
> Hi all,
> 
> I'm just trying to understand better how cassandra works. 
> 
> My understanding is that, once set, the number of vnodes does not change in a 
> cluster. The partitioner allocates vnodes to nodes ensuring replication data 
> are not stored on the same node.
> 
> But what happens if there are more nodes than vnodes? If I set num_tokens to 
> 3 and I have 5 servers? Unless the partitioner adds vnodes and moves data 
> around but it seems an extremely expensive operation. I'm sure I'm missing 
> something, I'm not quite sure what! :)
> 
> Thanks,
> Luca
> 



more nodes than vnodes

2022-06-15 Thread Luca Rondanini
Hi all,

I'm just trying to understand better how cassandra works.

My understanding is that, once set, the number of vnodes does not change in
a cluster. The partitioner allocates vnodes to nodes ensuring replication
data are not stored on the same node.

But what happens if there are more nodes than vnodes? If I set num_tokens
to 3 and I have 5 servers? Unless the partitioner adds vnodes and moves
data around but it seems an extremely expensive operation. I'm sure I'm
missing something, I'm not quite sure what! :)

Thanks,
Luca