Re: Uneven token distribution with allocate_tokens_for_keyspace

2019-12-04 Thread Anthony Grasso
Hi Enrico,

Glad to hear the problem has been resolved and thank you for the feedback!

Kind regards,
Anthony

On Mon, 2 Dec 2019 at 22:03, Enrico Cavallin 
wrote:

> Hi Anthony,
> thank you for your hints, now the new DC is well balanced within 2%.
> I did read your article, but I thought it was needed only for new
> "clusters", not also for new "DCs"; but RF is per DC so it makes sense.
>
> You TLP guys are doing a great job for Cassandra community.
>
> Thank you,
> Enrico
>
>
> On Fri, 29 Nov 2019 at 05:09, Anthony Grasso 
> wrote:
>
>> Hi Enrico,
>>
>> This is a classic chicken and egg problem with the
>> allocate_tokens_for_keyspace setting.
>>
>> The allocate_tokens_for_keyspace setting uses the replication factor of
>> a DC keyspace to calculate the token allocation when a node is added to the
>> cluster for the first time.
>>
>> Nodes need to be added to the new DC before we can replicate the keyspace
>> over to it. Herein lies the problem. We are unable to use
>> allocate_tokens_for_keyspace unless the keyspace is replicated to the
>> new DC. In addition, as soon as you change the keyspace replication to the
>> new DC, new data will start to be written to it. To work around this issue
>> you will need to do the following.
>>
>>1. Decommission all the nodes in the *dcNew*, one at a time.
>>2. Once all the *dcNew* nodes are decommissioned, wipe the contents
>>in the *commitlog*, *data*, *saved_caches*, and *hints* directories
>>of these nodes.
>>3. Make the first node to add into the *dcNew* a seed node. Set the
>>seed list of the first node with its IP address and the IP addresses of 
>> the
>>other seed nodes in the cluster.
>>4. Set the *initial_token* setting for the first node. You can
>>calculate the values using the algorithm in my blog post:
>>
>> https://thelastpickle.com/blog/2019/02/21/set-up-a-cluster-with-even-token-distribution.html.
>>For convenience I have calculated them:
>>*-9223372036854775808,-4611686018427387904,0,4611686018427387904*.
>>Note, remove the *allocate_tokens_for_keyspace* setting from the
>>*cassandra.yaml* file for this (seed) node.
>>5. Check to make sure that no other node in the cluster is assigned
>>any of the four tokens specified above. If there is another node in the
>>cluster that is assigned one of the above tokens, increment the 
>> conflicting
>>token by values of one until no other node in the cluster is assigned that
>>token value. The idea is to make sure that these four tokens are unique to
>>the node.
>>6. Add the seed node to cluster. Make sure it is listed in *dcNew *by
>>checking nodetool status.
>>7. Create a dummy keyspace in *dcNew* that has a replication factor
>>of 2.
>>8. Set the *allocate_tokens_for_keyspace* value to be the name of the
>>dummy keyspace for the other two nodes you want to add to *dcNew*.
>>Note remove the *initial_token* setting for these other nodes.
>>9. Set *auto_bootstrap* to *false* for the other two nodes you want
>>to add to *dcNew*.
>>10. Add the other two nodes to the cluster, one at a time.
>>11. If you are happy with the distribution, copy the data to *dcNew*
>>by running a rebuild.
>>
>>
>> Hope this helps.
>>
>> Regards,
>> Anthony
>>
>> On Fri, 29 Nov 2019 at 02:08, Enrico Cavallin 
>> wrote:
>>
>>> Hi all,
>>> I have an old datacenter with 4 nodes and 256 tokens each.
>>> I am now starting a new datacenter with 3 nodes and num_token=4
>>> and allocate_tokens_for_keyspace=myBiggestKeyspace in each node.
>>> Both DCs run Cassandra 3.11.x.
>>>
>>> myBiggestKeyspace has RF=3 in dcOld and RF=2 in dcNew. Now dcNew is very
>>> unbalanced.
>>> Also keyspaces with RF=2 in both DCs have the same problem.
>>> Did I miss something or even with  allocate_tokens_for_keyspace I have
>>> strong limitations with low num_token?
>>> Any suggestions on how to mitigate it?
>>>
>>> # nodetool status myBiggestKeyspace
>>> Datacenter: dcOld
>>> ===
>>> Status=Up/Down
>>> |/ State=Normal/Leaving/Joining/Moving
>>> --  Address   Load   Tokens   Owns (effective)  Host ID
>>>   Rack
>>> UN  x.x.x.x  515.83 GiB  256  76.2%
>>> fc462eb2-752f-4d26-aae3-84cb9c977b8a  rack1
>>> UN  x.x.x.x  504.09 GiB  256  72.7%
>>> d7af8685-ba95-4854-a220-bc52dc242e9c  rack1
>>> UN  x.x.x.x  507.50 GiB  256  74.6%
>>> b3a4d3d1-e87d-468b-a7d9-3c104e219536  rack1
>>> UN  x.x.x.x  490.81 GiB  256  76.5%
>>> 41e80c5b-e4e3-46f6-a16f-c784c0132dbc  rack1
>>>
>>> Datacenter: dcNew
>>> ==
>>> Status=Up/Down
>>> |/ State=Normal/Leaving/Joining/Moving
>>> --  AddressLoad   Tokens   Owns (effective)  Host ID
>>>Rack
>>> UN  x.x.x.x   145.47 KiB  456.3%
>>> 7d089351-077f-4c36-a2f5-007682f9c215  rack1
>>> UN  x.x.x.x   122.51 KiB  455.5%
>>> 625dafcb-0822-4c8b-8551-5350c528907a  rack1
>>> UN  

Re: "Maximum memory usage reached (512.000MiB), cannot allocate chunk of 1.000MiB"

2019-12-04 Thread Reid Pinchback
Probably helps to think of how swap actually functions.  It has a valid place, 
so long as the behavior of the kernel and the OOM killer are understood.

You can have a lot of cold pages that have nothing at all to do with C*.  If 
you look at where memory goes, it isn’t surprising to see things that the 
kernel finds it can page out, leaving RAM for better things.  I’ve seen crond 
soak up a lot of memory, and Dell’s assorted memory-bloated tooling, for 
example. Anything that is truly cold, swap is your friend because those things 
are infrequently used… swapping them in and out leaves more memory on average 
for what you want.  However, that’s not huge numbers, that could be something 
like a half gig of RAM kept routinely free, depending on the assorted tooling 
you have as a baseline install for servers.

If swap exists to avoid the OOM killer on truly active processes, the returns 
there diminish rapidly. Within seconds you’ll find you can’t even ssh into a 
box to investigate. In something like a traditional database it’s worth the 
pain because there are multiple child processes to the rdbms, and the OOM 
killer preferentially targets big process families.  Databases can go into a 
panic if you toast a child, and you have a full-blown recovery on your hands.  
Fortunately the more mature databases give you knobs for memory tuning, like 
being able to pin particular tables in memory if they are critical; anything 
not pinned (via madvise I believe) can get tossed when under pressure.

The situation is a bit different with C*.  By design, you have replicas that 
the clients automatically find, and things like speculative retry cause 
processing to skip over the slowpokes. The better-slow-than-dead argument seems 
more tenuous to me here than for an rdbms.  And if you have an SLA based on 
latency, you’ll never meet it if you have page faults happening during memory 
references in the JVM. So if you have swappiness enabled, probably best to keep 
it tuned low.  That way a busy C* JVM hopefully is one of the last victims in 
the race to shove pages to swap.



From: Shishir Kumar 
Reply-To: "user@cassandra.apache.org" 
Date: Wednesday, December 4, 2019 at 8:04 AM
To: "user@cassandra.apache.org" 
Subject: Re: "Maximum memory usage reached (512.000MiB), cannot allocate chunk 
of 1.000MiB"

Message from External Sender
Correct. Normally one should avoid this, as performance might degrade, but 
system will not die (until process gets paged out).

In production we haven't done this (just changed mmap_index_only). We have an 
environment which gets used for customer to train/beta test that grows rapidly. 
Investing on infra do not make sense from cost prospective, so swap as option.

But here if environment is up running it will be interesting to understand what 
is consuming memory and is infra sized correctly.

-Shishir
On Wed, 4 Dec 2019, 16:13 Hossein Ghiyasi Mehr, 
mailto:ghiyasim...@gmail.com>> wrote:
"3. Though Datastax do not recommended and recommends Horizontal scale, so 
based on your requirement alternate old fashion option is to add swap space."
Hi Shishir,
swap isn't recommended by DataStax!

---
VafaTech.com - A Total Solution for Data Gathering & Analysis
---


On Tue, Dec 3, 2019 at 5:53 PM Shishir Kumar 
mailto:shishirroy2...@gmail.com>> wrote:
Options: Assuming model and configurations are good and Data size per node less 
than 1 TB (though no such Benchmark).

1. Infra scale for memory
2. Try to change disk_access_mode to mmap_index_only.
In this case you should not have any in memory DB tables.
3. Though Datastax do not recommended and recommends Horizontal scale, so based 
on your requirement alternate old fashion option is to add swap space.

-Shishir

On Tue, 3 Dec 2019, 15:52 John Belliveau, 
mailto:belliveau.j...@gmail.com>> wrote:
Reid,

I've only been working with Cassandra for 2 years, and this echoes my 
experience as well.

Regarding the cache use, I know every use case is different, but have you 
experimented and found any performance benefit to increasing its size?

Thanks,
John Belliveau

On Mon, Dec 2, 2019, 11:07 AM Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>> wrote:
Rahul, if my memory of this is correct, that particular logging message is 
noisy, the cache is pretty much always used to its limit (and why not, it’s a 
cache, no point in using less than you have).

No matter what value you set, you’ll just change the “reached (….)” part of it. 
 I think what would help you more is to work with the team(s) that have apps 
depending upon C* and decide what your performance SLA is with them.  If you 
are meeting your SLA, you don’t care about noisy messages.  If you aren’t 
meeting your SLA, then the noisy messages become sources of ideas to look at.

One thing you’ll find out pretty quickly.  There are a lot of knobs you can 
turn with C*, too many to allow for 

Re: "Maximum memory usage reached (512.000MiB), cannot allocate chunk of 1.000MiB"

2019-12-04 Thread Shishir Kumar
Correct. Normally one should avoid this, as performance might degrade, but
system will not die (until process gets paged out).

In production we haven't done this (just changed mmap_index_only). We have
an environment which gets used for customer to train/beta test that grows
rapidly. Investing on infra do not make sense from cost prospective, so
swap as option.

But here if environment is up running it will be interesting to understand
what is consuming memory and is infra sized correctly.

-Shishir

On Wed, 4 Dec 2019, 16:13 Hossein Ghiyasi Mehr, 
wrote:

> "3. Though Datastax do not recommended and recommends Horizontal scale, so
> based on your requirement alternate old fashion option is to add swap
> space."
> Hi Shishir,
> swap isn't recommended by DataStax!
>
> *---*
> *VafaTech.com - A Total Solution for Data Gathering & Analysis*
> *---*
>
>
> On Tue, Dec 3, 2019 at 5:53 PM Shishir Kumar 
> wrote:
>
>> Options: Assuming model and configurations are good and Data size per
>> node less than 1 TB (though no such Benchmark).
>>
>> 1. Infra scale for memory
>> 2. Try to change disk_access_mode to mmap_index_only.
>> In this case you should not have any in memory DB tables.
>> 3. Though Datastax do not recommended and recommends Horizontal scale, so
>> based on your requirement alternate old fashion option is to add swap space.
>>
>> -Shishir
>>
>> On Tue, 3 Dec 2019, 15:52 John Belliveau, 
>> wrote:
>>
>>> Reid,
>>>
>>> I've only been working with Cassandra for 2 years, and this echoes my
>>> experience as well.
>>>
>>> Regarding the cache use, I know every use case is different, but have
>>> you experimented and found any performance benefit to increasing its size?
>>>
>>> Thanks,
>>> John Belliveau
>>>
>>>
>>> On Mon, Dec 2, 2019, 11:07 AM Reid Pinchback 
>>> wrote:
>>>
 Rahul, if my memory of this is correct, that particular logging message
 is noisy, the cache is pretty much always used to its limit (and why not,
 it’s a cache, no point in using less than you have).



 No matter what value you set, you’ll just change the “reached (….)”
 part of it.  I think what would help you more is to work with the team(s)
 that have apps depending upon C* and decide what your performance SLA is
 with them.  If you are meeting your SLA, you don’t care about noisy
 messages.  If you aren’t meeting your SLA, then the noisy messages become
 sources of ideas to look at.



 One thing you’ll find out pretty quickly.  There are a lot of knobs you
 can turn with C*, too many to allow for easy answers on what you should
 do.  Figure out what your throughput and latency SLAs are, and you’ll know
 when to stop tuning.  Otherwise you’ll discover that it’s a rabbit hole you
 can dive into and not come out of for weeks.





 *From: *Hossein Ghiyasi Mehr 
 *Reply-To: *"user@cassandra.apache.org" 
 *Date: *Monday, December 2, 2019 at 10:35 AM
 *To: *"user@cassandra.apache.org" 
 *Subject: *Re: "Maximum memory usage reached (512.000MiB), cannot
 allocate chunk of 1.000MiB"



 *Message from External Sender*

 It may be helpful:
 https://thelastpickle.com/blog/2018/08/08/compression_performance.html
 

 It's complex. Simple explanation, cassandra keeps sstables in memory
 based on chunk size and sstable parts. It manage loading new sstables to
 memory based on requests on different sstables correctly . You should be
 worry about it (sstables loaded in memory)


 *VafaTech.com - A Total Solution for Data Gathering & Analysis*





 On Mon, Dec 2, 2019 at 6:18 PM Rahul Reddy 
 wrote:

 Thanks Hossein,



 How does the chunks are moved out of memory (LRU?) if it want to make
 room for new requests to get chunks?if it has mechanism to clear chunks
 from cache what causes to cannot allocate chunk? Can you point me to any
 documention?



 On Sun, Dec 1, 2019, 12:03 PM Hossein Ghiyasi Mehr <
 ghiyasim...@gmail.com> wrote:

 Chunks are part of sstables. When there is enough space in memory to
 cache them, read performance will increase if application requests it
 again.



 Your real answer is application dependent. For example write heavy
 applications are different than read heavy or read-write heavy. Real time
 applications are different than time series data environments and ... .







 On 

Re: "Maximum memory usage reached (512.000MiB), cannot allocate chunk of 1.000MiB"

2019-12-04 Thread Hossein Ghiyasi Mehr
"3. Though Datastax do not recommended and recommends Horizontal scale, so
based on your requirement alternate old fashion option is to add swap
space."
Hi Shishir,
swap isn't recommended by DataStax!

*---*
*VafaTech.com - A Total Solution for Data Gathering & Analysis*
*---*


On Tue, Dec 3, 2019 at 5:53 PM Shishir Kumar 
wrote:

> Options: Assuming model and configurations are good and Data size per node
> less than 1 TB (though no such Benchmark).
>
> 1. Infra scale for memory
> 2. Try to change disk_access_mode to mmap_index_only.
> In this case you should not have any in memory DB tables.
> 3. Though Datastax do not recommended and recommends Horizontal scale, so
> based on your requirement alternate old fashion option is to add swap space.
>
> -Shishir
>
> On Tue, 3 Dec 2019, 15:52 John Belliveau, 
> wrote:
>
>> Reid,
>>
>> I've only been working with Cassandra for 2 years, and this echoes my
>> experience as well.
>>
>> Regarding the cache use, I know every use case is different, but have you
>> experimented and found any performance benefit to increasing its size?
>>
>> Thanks,
>> John Belliveau
>>
>>
>> On Mon, Dec 2, 2019, 11:07 AM Reid Pinchback 
>> wrote:
>>
>>> Rahul, if my memory of this is correct, that particular logging message
>>> is noisy, the cache is pretty much always used to its limit (and why not,
>>> it’s a cache, no point in using less than you have).
>>>
>>>
>>>
>>> No matter what value you set, you’ll just change the “reached (….)” part
>>> of it.  I think what would help you more is to work with the team(s) that
>>> have apps depending upon C* and decide what your performance SLA is with
>>> them.  If you are meeting your SLA, you don’t care about noisy messages.
>>> If you aren’t meeting your SLA, then the noisy messages become sources of
>>> ideas to look at.
>>>
>>>
>>>
>>> One thing you’ll find out pretty quickly.  There are a lot of knobs you
>>> can turn with C*, too many to allow for easy answers on what you should
>>> do.  Figure out what your throughput and latency SLAs are, and you’ll know
>>> when to stop tuning.  Otherwise you’ll discover that it’s a rabbit hole you
>>> can dive into and not come out of for weeks.
>>>
>>>
>>>
>>>
>>>
>>> *From: *Hossein Ghiyasi Mehr 
>>> *Reply-To: *"user@cassandra.apache.org" 
>>> *Date: *Monday, December 2, 2019 at 10:35 AM
>>> *To: *"user@cassandra.apache.org" 
>>> *Subject: *Re: "Maximum memory usage reached (512.000MiB), cannot
>>> allocate chunk of 1.000MiB"
>>>
>>>
>>>
>>> *Message from External Sender*
>>>
>>> It may be helpful:
>>> https://thelastpickle.com/blog/2018/08/08/compression_performance.html
>>> 
>>>
>>> It's complex. Simple explanation, cassandra keeps sstables in memory
>>> based on chunk size and sstable parts. It manage loading new sstables to
>>> memory based on requests on different sstables correctly . You should be
>>> worry about it (sstables loaded in memory)
>>>
>>>
>>> *VafaTech.com - A Total Solution for Data Gathering & Analysis*
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Dec 2, 2019 at 6:18 PM Rahul Reddy 
>>> wrote:
>>>
>>> Thanks Hossein,
>>>
>>>
>>>
>>> How does the chunks are moved out of memory (LRU?) if it want to make
>>> room for new requests to get chunks?if it has mechanism to clear chunks
>>> from cache what causes to cannot allocate chunk? Can you point me to any
>>> documention?
>>>
>>>
>>>
>>> On Sun, Dec 1, 2019, 12:03 PM Hossein Ghiyasi Mehr <
>>> ghiyasim...@gmail.com> wrote:
>>>
>>> Chunks are part of sstables. When there is enough space in memory to
>>> cache them, read performance will increase if application requests it
>>> again.
>>>
>>>
>>>
>>> Your real answer is application dependent. For example write heavy
>>> applications are different than read heavy or read-write heavy. Real time
>>> applications are different than time series data environments and ... .
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Dec 1, 2019 at 7:09 PM Rahul Reddy 
>>> wrote:
>>>
>>> Hello,
>>>
>>>
>>>
>>> We are seeing memory usage reached 512 mb and cannot allocate 1MB.  I
>>> see this because file_cache_size_mb by default set to 512MB.
>>>
>>>
>>>
>>> Datastax document recommends to increase the file_cache_size.
>>>
>>>
>>>
>>> We have 32G over all memory allocated 16G to Cassandra. What is the
>>> recommended value in my case. And also when does this memory gets filled up
>>> frequent does nodeflush helps in avoiding this info messages?
>>>
>>>