Re: Curiosity in adding nodes

2019-10-21 Thread guo Maxwell
1.the node added to the ring will calculate the token range it owns, then
get the data of the range from the nodes originally owned the data.
2.then the streamed sstable and the range of the sstable should be
estimated.
3.then streaming begins .secondary index will be build afther sstabte
streamed successfully.
4.when all data is transferred ,the node's status will change from joining
to normal .and the node status will be insert to system keyspace.
5.during the time the data is streaming . added node can be write data but
no select.

Eunsu Kim  于2019年10月22日周二 上午9:54写道:

> Hi experts,
>
> When a new node was added, how can the coordinator find data that has been
> not yet streamed?
>
> Or is new nodes not used until all data is streamed?
>
> Thanks in advance
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

-- 
you are the apple of my eye !


Re: Curiosity in adding nodes

2019-10-21 Thread Craig Pastro
If I understand correctly this is controlled by setting `auto_bootstrap`.
If it is set to true (the default), once the node joins the cluster it will
have some portion of the data assigned to it, and its data will be streamed
to it from the other nodes. Once the data has finished streaming only then
will this node start to answer queries. So to answer your question,

> Or is new nodes not used until all data is streamed?

Yes, by default.

You probably do not want to set `auto_bootstrap` to false. In fact, it is
"hidden" in `cassandra.yaml` (
https://issues.apache.org/jira/browse/CASSANDRA-2447). To see why you do
not want to set it to false there are a couple of nice articles:
https://monzo.com/blog/2019/09/08/why-monzo-wasnt-working-on-july-29th
https://thelastpickle.com/blog/2017/05/23/auto-bootstrapping-part1.html



On Tue, Oct 22, 2019 at 10:54 AM Eunsu Kim  wrote:

> Hi experts,
>
> When a new node was added, how can the coordinator find data that has been
> not yet streamed?
>
> Or is new nodes not used until all data is streamed?
>
> Thanks in advance
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


Curiosity in adding nodes

2019-10-21 Thread Eunsu Kim
Hi experts,

When a new node was added, how can the coordinator find data that has been not 
yet streamed?

Or is new nodes not used until all data is streamed?

Thanks in advance
-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: [EXTERNAL] Re: GC Tuning https://thelastpickle.com/blog/2018/04/11/gc-tuning.html

2019-10-21 Thread Sergio
Thanks Jon!

I used that tool and I did a test to compare LCS and STCS and it works
great. However, I was referring to the JVM flags that you use since there
are a lot of flags that I found as default and I would like to exclude the
unused or wrong ones from the current configuration.

I have also another thread opened where I am trying to figure out Kernel
Settings for TCP
https://lists.apache.org/thread.html/7708c22a1d95882598cbcc29bc34fa54c01fcb33c40bb616dcd3956d@%3Cuser.cassandra.apache.org%3E

Do you have anything to add to that?

Thanks,

Sergio

Il giorno lun 21 ott 2019 alle ore 15:09 Jon Haddad  ha
scritto:

> tlp-stress comes with workloads pre-baked, so there's not much
> configuration to do.  The main flags you'll want are going to be:
>
> -d : duration, I highly recommend running your test for a few days
> --compaction
> --compression
> -p: number of partitions
> -r: % of reads, 0-1
>
> For example, you might run:
>
> tlp-stress run KeyValue -d 24h --compaction lcs -p 10m -r .9
>
> for a basic key value table, running for 24 hours, using LCS, 10 million
> partitions, 90% reads.
>
> There's a lot of options. I won't list them all here, it's why I wrote the
> manual :)
>
> Jon
>
>
> On Mon, Oct 21, 2019 at 1:16 PM Sergio  wrote:
>
>> Thanks, guys!
>> I just copied and paste what I found on our test machines but I can
>> confirm that we have the same settings except for 8GB in production.
>> I didn't select these settings and I need to verify why these settings
>> are there.
>> If any of you want to share your flags for a read-heavy workload it would
>> be appreciated, so I would replace and test those flags with TLP-STRESS.
>> I am thinking about different approaches (G1GC vs ParNew + CMS)
>> How many GB for RAM do you dedicate to the OS in percentage or in an
>> exact number?
>> Can you share the flags for ParNew + CMS that I can play with it and
>> perform a test?
>>
>> Best,
>> Sergio
>>
>>
>> Il giorno lun 21 ott 2019 alle ore 09:27 Reid Pinchback <
>> rpinchb...@tripadvisor.com> ha scritto:
>>
>>> Since the instance size is < 32gb, hopefully swap isn’t being used, so
>>> it should be moot.
>>>
>>>
>>>
>>> Sergio, also be aware that  -XX:+CMSClassUnloadingEnabled probably
>>> doesn’t do anything for you.  I believe that only applies to CMS, not
>>> G1GC.  I also wouldn’t take it as gospel truth that  -XX:+UseNUMA is a good
>>> thing on AWS (or anything virtualized), you’d have to run your own tests
>>> and find out.
>>>
>>>
>>>
>>> R
>>>
>>> *From: *Jon Haddad 
>>> *Reply-To: *"user@cassandra.apache.org" 
>>> *Date: *Monday, October 21, 2019 at 12:06 PM
>>> *To: *"user@cassandra.apache.org" 
>>> *Subject: *Re: [EXTERNAL] Re: GC Tuning
>>> https://thelastpickle.com/blog/2018/04/11/gc-tuning.html
>>>
>>>
>>>
>>> *Message from External Sender*
>>>
>>> One thing to note, if you're going to use a big heap, cap it at 31GB,
>>> not 32.  Once you go to 32GB, you don't get to use compressed pointers [1],
>>> so you get less addressable space than at 31GB.
>>>
>>>
>>>
>>> [1]
>>> https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/
>>> 
>>>
>>>
>>>
>>> On Mon, Oct 21, 2019 at 11:39 AM Durity, Sean R <
>>> sean_r_dur...@homedepot.com> wrote:
>>>
>>> I don’t disagree with Jon, who has all kinds of performance tuning
>>> experience. But for ease of operation, we only use G1GC (on Java 8),
>>> because the tuning of ParNew+CMS requires a high degree of knowledge and
>>> very repeatable testing harnesses. It isn’t worth our time. As a previous
>>> writer mentioned, there is usually better return on our time tuning the
>>> schema (aka helping developers understand Cassandra’s strengths).
>>>
>>>
>>>
>>> We use 16 – 32 GB heaps, nothing smaller than that.
>>>
>>>
>>>
>>> Sean Durity
>>>
>>>
>>>
>>> *From:* Jon Haddad 
>>> *Sent:* Monday, October 21, 2019 10:43 AM
>>> *To:* user@cassandra.apache.org
>>> *Subject:* [EXTERNAL] Re: GC Tuning
>>> https://thelastpickle.com/blog/2018/04/11/gc-tuning.html
>>> 
>>>
>>>
>>>
>>> I still use ParNew + CMS over G1GC with Java 8.  I haven't done a
>>> comparison with JDK 11 yet, so I'm not sure if it's any better.  I've heard
>>> it is, but I like to verify first.  The pause times with ParNew + CMS are
>>> generally lower than G1 when tuned right, but as Chris said it can be
>>> tricky.  If you aren't willing to spend the time 

Re: [EXTERNAL] Re: GC Tuning https://thelastpickle.com/blog/2018/04/11/gc-tuning.html

2019-10-21 Thread Jon Haddad
tlp-stress comes with workloads pre-baked, so there's not much
configuration to do.  The main flags you'll want are going to be:

-d : duration, I highly recommend running your test for a few days
--compaction
--compression
-p: number of partitions
-r: % of reads, 0-1

For example, you might run:

tlp-stress run KeyValue -d 24h --compaction lcs -p 10m -r .9

for a basic key value table, running for 24 hours, using LCS, 10 million
partitions, 90% reads.

There's a lot of options. I won't list them all here, it's why I wrote the
manual :)

Jon


On Mon, Oct 21, 2019 at 1:16 PM Sergio  wrote:

> Thanks, guys!
> I just copied and paste what I found on our test machines but I can
> confirm that we have the same settings except for 8GB in production.
> I didn't select these settings and I need to verify why these settings are
> there.
> If any of you want to share your flags for a read-heavy workload it would
> be appreciated, so I would replace and test those flags with TLP-STRESS.
> I am thinking about different approaches (G1GC vs ParNew + CMS)
> How many GB for RAM do you dedicate to the OS in percentage or in an exact
> number?
> Can you share the flags for ParNew + CMS that I can play with it and
> perform a test?
>
> Best,
> Sergio
>
>
> Il giorno lun 21 ott 2019 alle ore 09:27 Reid Pinchback <
> rpinchb...@tripadvisor.com> ha scritto:
>
>> Since the instance size is < 32gb, hopefully swap isn’t being used, so it
>> should be moot.
>>
>>
>>
>> Sergio, also be aware that  -XX:+CMSClassUnloadingEnabled probably
>> doesn’t do anything for you.  I believe that only applies to CMS, not
>> G1GC.  I also wouldn’t take it as gospel truth that  -XX:+UseNUMA is a good
>> thing on AWS (or anything virtualized), you’d have to run your own tests
>> and find out.
>>
>>
>>
>> R
>>
>> *From: *Jon Haddad 
>> *Reply-To: *"user@cassandra.apache.org" 
>> *Date: *Monday, October 21, 2019 at 12:06 PM
>> *To: *"user@cassandra.apache.org" 
>> *Subject: *Re: [EXTERNAL] Re: GC Tuning
>> https://thelastpickle.com/blog/2018/04/11/gc-tuning.html
>>
>>
>>
>> *Message from External Sender*
>>
>> One thing to note, if you're going to use a big heap, cap it at 31GB, not
>> 32.  Once you go to 32GB, you don't get to use compressed pointers [1], so
>> you get less addressable space than at 31GB.
>>
>>
>>
>> [1]
>> https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/
>> 
>>
>>
>>
>> On Mon, Oct 21, 2019 at 11:39 AM Durity, Sean R <
>> sean_r_dur...@homedepot.com> wrote:
>>
>> I don’t disagree with Jon, who has all kinds of performance tuning
>> experience. But for ease of operation, we only use G1GC (on Java 8),
>> because the tuning of ParNew+CMS requires a high degree of knowledge and
>> very repeatable testing harnesses. It isn’t worth our time. As a previous
>> writer mentioned, there is usually better return on our time tuning the
>> schema (aka helping developers understand Cassandra’s strengths).
>>
>>
>>
>> We use 16 – 32 GB heaps, nothing smaller than that.
>>
>>
>>
>> Sean Durity
>>
>>
>>
>> *From:* Jon Haddad 
>> *Sent:* Monday, October 21, 2019 10:43 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* [EXTERNAL] Re: GC Tuning
>> https://thelastpickle.com/blog/2018/04/11/gc-tuning.html
>> 
>>
>>
>>
>> I still use ParNew + CMS over G1GC with Java 8.  I haven't done a
>> comparison with JDK 11 yet, so I'm not sure if it's any better.  I've heard
>> it is, but I like to verify first.  The pause times with ParNew + CMS are
>> generally lower than G1 when tuned right, but as Chris said it can be
>> tricky.  If you aren't willing to spend the time understanding how it works
>> and why each setting matters, G1 is a better option.
>>
>>
>>
>> I wouldn't run Cassandra in production on less than 8GB of heap - I
>> consider it the absolute minimum.  For G1 I'd use 16GB, and never 4GB with
>> Cassandra unless you're rarely querying it.
>>
>>
>>
>> I typically use the following as a starting point now:
>>
>>
>>
>> ParNew + CMS
>>
>> 16GB heap
>>
>> 10GB new gen
>>
>> 2GB memtable cap, otherwise you'll spend a bunch of time copying around
>> memtables (cassandra.yaml)
>>
>> Max tenuring threshold: 2
>>
>> survivor ratio 6
>>
>>
>>
>> I've also done some tests with a 30GB heap, 24 GB of which was new gen.
>> This worked surprisingly well in my tests since it essentially keeps
>> everything out of the old 

Re: Cassandra Recommended System Settings

2019-10-21 Thread Elliott Sims
The TCP settings are basically "how much RAM to use to buffer data for TCP
sessions, per session", which translates roughly to maximum TCP window
size.  You can actually calculate approximately what you need by just
multiplying bandwidth and latency (10,000,000,000bps * .0001s * 1GB/8Gb =
125KB buffer needed to fill the pipe).  In practice, I'd double or triple
the max setting vs the calculated value.  The suggested value from Datastax
is 16MB, which doesn't seem like a lot, but if you have 1, connections
that could lead to up to 16GB of RAM being dedicated to TCP buffers.

As an example, my traffic in and out of Cassandra is within a local 10Gb
network.  I use "409687380   6291456", but that's not particularly
highly-tuned for Cassandra specifically (that is, it's a value also used by
hosts that talk to the outside internet with much higher latency).

On Mon, Oct 21, 2019 at 1:53 PM Sergio  wrote:

> Thanks Elliott!
>
> How do you know if there is too much RAM used for those settings?
>
> Which metrics do you keep track of?
>
> What would you recommend instead?
>
> Best,
>
> Sergio
>
> On Mon, Oct 21, 2019, 1:41 PM Elliott Sims  wrote:
>
>> Based on my experiences, if you have a new enough kernel I'd strongly
>> suggest switching the TCP scheduler algorithm to BBR.  I've found the rest
>> tend to be extremely sensitive to even small amounts of packet loss among
>> cluster members where BBR holds up well.
>>
>> High ulimits for basically everything are probably a good idea, although
>> "unlimited" may not be purely optimal for all cases.
>> The TCP keepalive settings are probably only necessary for traffic
>> buggy/misconfigured firewalls, but shouldn't really do any harm on a modern
>> fast network.
>>
>> The TCP memory settings are pretty aggressive and probably result in
>> unnecessary RAM usage.
>> The net.core.rmem_default/net.core.wmem_default settings are overridden
>> by the TCP-specific settings as far as I know, so they're not really
>> relevant/helpful for Cassandra
>> The net.ipv4.tcp_rmem/net.ipv4.tcp_wmem max settings are pretty
>> aggressive.  That works out to something like 1Gbps with 130ms latency per
>> TCP connection, but on a local LAN with latencies <1ms it's enough buffer
>> for over 100Gbps per TCP session.  A much smaller value will probably make
>> more sense for most setups.
>>
>>
>> On Mon, Oct 21, 2019 at 10:21 AM Sergio 
>> wrote:
>>
>>>
>>> Hello!
>>>
>>> This is the kernel that I am using
>>> Linux  4.16.13-1.el7.elrepo.x86_64 #1 SMP Wed May 30 14:31:51 EDT 2018
>>> x86_64 x86_64 x86_64 GNU/Linux
>>>
>>> Best,
>>>
>>> Sergio
>>>
>>> Il giorno lun 21 ott 2019 alle ore 07:30 Reid Pinchback <
>>> rpinchb...@tripadvisor.com> ha scritto:
>>>
 I don't know which distro and version you are using, but watch out for
 surprises in what vm.swappiness=0 means.  In older kernels it means "only
 use swap when desperate".  I believe that newer kernels changed to have 1
 mean that, and 0 means to always use the oomkiller.  Neither situation is
 strictly good or bad, what matters is what you intend the system behavior
 to be in comparison with whatever monitoring/alerting you have put in 
 place.

 R


 On 10/18/19, 9:04 PM, "Sergio Bilello" 
 wrote:

  Message from External Sender

 Hello everyone!



 Do you have any setting that you would change or tweak from the
 below list?



 sudo cat /proc/4379/limits

 Limit Soft Limit   Hard Limit
  Units

 Max cpu time  unlimitedunlimited
 seconds

 Max file size unlimitedunlimited
 bytes

 Max data size unlimitedunlimited
 bytes

 Max stack sizeunlimitedunlimited
 bytes

 Max core file sizeunlimitedunlimited
 bytes

 Max resident set  unlimitedunlimited
 bytes

 Max processes 3276832768
 processes

 Max open files1048576  1048576
 files

 Max locked memory unlimitedunlimited
 bytes

 Max address space unlimitedunlimited
 bytes

 Max file locksunlimitedunlimited
 locks

 Max pending signals   unlimitedunlimited
 signals

 Max msgqueue size unlimitedunlimited
 bytes

 Max nice priority 00

 Max realtime priority 00

 Max realtime timeout  unlimitedunlimited
 us



 These are the sysctl settings

 default['cassandra']['sysctl'] = {

 

Re: Cassandra Recommended System Settings

2019-10-21 Thread Reid Pinchback
Sergio, if you do some online searching about ‘bufferbloat’ in networking, 
you’ll find the background to help explain what motivates networking changes.  
Actual investigation of network performance can get a bit gnarly.  The TL;DR 
summary is that big buffers function like big queues, and thus attempts to 
speed up throughput can cause things stuck in a queue to have higher latency.  
With very fast networks, there isn’t as much need to have big buffers.  Imagine 
having a coordinator node waiting to respond to a query but can’t because a 
bunch of merkel trees are sitting in the tcp buffer waiting to be sent out.  
Sometimes total latency doesn’t fairly measure actual effort to do the work, 
some of that can be time spent sitting waiting in the buffer to be shipped out 
back to the client.

From: Sergio 
Reply-To: "user@cassandra.apache.org" 
Date: Monday, October 21, 2019 at 4:54 PM
To: "user@cassandra.apache.org" 
Subject: Re: Cassandra Recommended System Settings

Message from External Sender
Thanks Elliott!

How do you know if there is too much RAM used for those settings?

Which metrics do you keep track of?

What would you recommend instead?

Best,

Sergio

On Mon, Oct 21, 2019, 1:41 PM Elliott Sims 
mailto:elli...@backblaze.com>> wrote:
Based on my experiences, if you have a new enough kernel I'd strongly suggest 
switching the TCP scheduler algorithm to BBR.  I've found the rest tend to be 
extremely sensitive to even small amounts of packet loss among cluster members 
where BBR holds up well.
High ulimits for basically everything are probably a good idea, although 
"unlimited" may not be purely optimal for all cases.
The TCP keepalive settings are probably only necessary for traffic 
buggy/misconfigured firewalls, but shouldn't really do any harm on a modern 
fast network.
The TCP memory settings are pretty aggressive and probably result in 
unnecessary RAM usage.
The net.core.rmem_default/net.core.wmem_default settings are overridden by the 
TCP-specific settings as far as I know, so they're not really relevant/helpful 
for Cassandra
The net.ipv4.tcp_rmem/net.ipv4.tcp_wmem max settings are pretty aggressive.  
That works out to something like 1Gbps with 130ms latency per TCP connection, 
but on a local LAN with latencies <1ms it's enough buffer for over 100Gbps per 
TCP session.  A much smaller value will probably make more sense for most 
setups.


On Mon, Oct 21, 2019 at 10:21 AM Sergio 
mailto:lapostadiser...@gmail.com>> wrote:

Hello!

This is the kernel that I am using
Linux  4.16.13-1.el7.elrepo.x86_64 #1 SMP Wed May 30 14:31:51 EDT 2018 x86_64 
x86_64 x86_64 GNU/Linux

Best,

Sergio

Il giorno lun 21 ott 2019 alle ore 07:30 Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>> ha scritto:
I don't know which distro and version you are using, but watch out for 
surprises in what vm.swappiness=0 means.  In older kernels it means "only use 
swap when desperate".  I believe that newer kernels changed to have 1 mean 
that, and 0 means to always use the oomkiller.  Neither situation is strictly 
good or bad, what matters is what you intend the system behavior to be in 
comparison with whatever monitoring/alerting you have put in place.

R


On 10/18/19, 9:04 PM, "Sergio Bilello" 
mailto:lapostadiser...@gmail.com>> wrote:

 Message from External Sender

Hello everyone!



Do you have any setting that you would change or tweak from the below list?



sudo cat /proc/4379/limits

Limit Soft Limit   Hard Limit   Units

Max cpu time  unlimitedunlimitedseconds

Max file size unlimitedunlimitedbytes

Max data size unlimitedunlimitedbytes

Max stack sizeunlimitedunlimitedbytes

Max core file sizeunlimitedunlimitedbytes

Max resident set  unlimitedunlimitedbytes

Max processes 3276832768
processes

Max open files1048576  1048576  files

Max locked memory unlimitedunlimitedbytes

Max address space unlimitedunlimitedbytes

Max file locksunlimitedunlimitedlocks

Max pending signals   unlimitedunlimitedsignals

Max msgqueue size unlimitedunlimitedbytes

Max nice priority 00

Max realtime priority 00

Max realtime timeout  unlimitedunlimitedus



These are the sysctl settings

default['cassandra']['sysctl'] = {

'net.ipv4.tcp_keepalive_time' => 60,

'net.ipv4.tcp_keepalive_probes' => 3,

'net.ipv4.tcp_keepalive_intvl' => 10,


Re: Cassandra Recommended System Settings

2019-10-21 Thread Sergio
Thanks Elliott!

How do you know if there is too much RAM used for those settings?

Which metrics do you keep track of?

What would you recommend instead?

Best,

Sergio

On Mon, Oct 21, 2019, 1:41 PM Elliott Sims  wrote:

> Based on my experiences, if you have a new enough kernel I'd strongly
> suggest switching the TCP scheduler algorithm to BBR.  I've found the rest
> tend to be extremely sensitive to even small amounts of packet loss among
> cluster members where BBR holds up well.
>
> High ulimits for basically everything are probably a good idea, although
> "unlimited" may not be purely optimal for all cases.
> The TCP keepalive settings are probably only necessary for traffic
> buggy/misconfigured firewalls, but shouldn't really do any harm on a modern
> fast network.
>
> The TCP memory settings are pretty aggressive and probably result in
> unnecessary RAM usage.
> The net.core.rmem_default/net.core.wmem_default settings are overridden by
> the TCP-specific settings as far as I know, so they're not really
> relevant/helpful for Cassandra
> The net.ipv4.tcp_rmem/net.ipv4.tcp_wmem max settings are pretty
> aggressive.  That works out to something like 1Gbps with 130ms latency per
> TCP connection, but on a local LAN with latencies <1ms it's enough buffer
> for over 100Gbps per TCP session.  A much smaller value will probably make
> more sense for most setups.
>
>
> On Mon, Oct 21, 2019 at 10:21 AM Sergio  wrote:
>
>>
>> Hello!
>>
>> This is the kernel that I am using
>> Linux  4.16.13-1.el7.elrepo.x86_64 #1 SMP Wed May 30 14:31:51 EDT 2018
>> x86_64 x86_64 x86_64 GNU/Linux
>>
>> Best,
>>
>> Sergio
>>
>> Il giorno lun 21 ott 2019 alle ore 07:30 Reid Pinchback <
>> rpinchb...@tripadvisor.com> ha scritto:
>>
>>> I don't know which distro and version you are using, but watch out for
>>> surprises in what vm.swappiness=0 means.  In older kernels it means "only
>>> use swap when desperate".  I believe that newer kernels changed to have 1
>>> mean that, and 0 means to always use the oomkiller.  Neither situation is
>>> strictly good or bad, what matters is what you intend the system behavior
>>> to be in comparison with whatever monitoring/alerting you have put in place.
>>>
>>> R
>>>
>>>
>>> On 10/18/19, 9:04 PM, "Sergio Bilello" 
>>> wrote:
>>>
>>>  Message from External Sender
>>>
>>> Hello everyone!
>>>
>>>
>>>
>>> Do you have any setting that you would change or tweak from the
>>> below list?
>>>
>>>
>>>
>>> sudo cat /proc/4379/limits
>>>
>>> Limit Soft Limit   Hard Limit
>>>  Units
>>>
>>> Max cpu time  unlimitedunlimited
>>> seconds
>>>
>>> Max file size unlimitedunlimited
>>> bytes
>>>
>>> Max data size unlimitedunlimited
>>> bytes
>>>
>>> Max stack sizeunlimitedunlimited
>>> bytes
>>>
>>> Max core file sizeunlimitedunlimited
>>> bytes
>>>
>>> Max resident set  unlimitedunlimited
>>> bytes
>>>
>>> Max processes 3276832768
>>> processes
>>>
>>> Max open files1048576  1048576
>>> files
>>>
>>> Max locked memory unlimitedunlimited
>>> bytes
>>>
>>> Max address space unlimitedunlimited
>>> bytes
>>>
>>> Max file locksunlimitedunlimited
>>> locks
>>>
>>> Max pending signals   unlimitedunlimited
>>> signals
>>>
>>> Max msgqueue size unlimitedunlimited
>>> bytes
>>>
>>> Max nice priority 00
>>>
>>> Max realtime priority 00
>>>
>>> Max realtime timeout  unlimitedunlimited
>>> us
>>>
>>>
>>>
>>> These are the sysctl settings
>>>
>>> default['cassandra']['sysctl'] = {
>>>
>>> 'net.ipv4.tcp_keepalive_time' => 60,
>>>
>>> 'net.ipv4.tcp_keepalive_probes' => 3,
>>>
>>> 'net.ipv4.tcp_keepalive_intvl' => 10,
>>>
>>> 'net.core.rmem_max' => 16777216,
>>>
>>> 'net.core.wmem_max' => 16777216,
>>>
>>> 'net.core.rmem_default' => 16777216,
>>>
>>> 'net.core.wmem_default' => 16777216,
>>>
>>> 'net.core.optmem_max' => 40960,
>>>
>>> 'net.ipv4.tcp_rmem' => '4096 87380 16777216',
>>>
>>> 'net.ipv4.tcp_wmem' => '4096 65536 16777216',
>>>
>>> 'net.ipv4.ip_local_port_range' => '1 65535',
>>>
>>> 'net.ipv4.tcp_window_scaling' => 1,
>>>
>>>'net.core.netdev_max_backlog' => 2500,
>>>
>>>'net.core.somaxconn' => 65000,
>>>
>>> 'vm.max_map_count' => 1048575,
>>>
>>> 'vm.swappiness' => 0
>>>
>>> }
>>>
>>>
>>>
>>> Am I missing something else?
>>>
>>>
>>>
>>> Do you have any experience to configure CENTOS 7
>>>
>>> for
>>>
>>> JAVA HUGE PAGES
>>>
>>>
>>> 

Re: Cassandra Recommended System Settings

2019-10-21 Thread Elliott Sims
Based on my experiences, if you have a new enough kernel I'd strongly
suggest switching the TCP scheduler algorithm to BBR.  I've found the rest
tend to be extremely sensitive to even small amounts of packet loss among
cluster members where BBR holds up well.

High ulimits for basically everything are probably a good idea, although
"unlimited" may not be purely optimal for all cases.
The TCP keepalive settings are probably only necessary for traffic
buggy/misconfigured firewalls, but shouldn't really do any harm on a modern
fast network.

The TCP memory settings are pretty aggressive and probably result in
unnecessary RAM usage.
The net.core.rmem_default/net.core.wmem_default settings are overridden by
the TCP-specific settings as far as I know, so they're not really
relevant/helpful for Cassandra
The net.ipv4.tcp_rmem/net.ipv4.tcp_wmem max settings are pretty
aggressive.  That works out to something like 1Gbps with 130ms latency per
TCP connection, but on a local LAN with latencies <1ms it's enough buffer
for over 100Gbps per TCP session.  A much smaller value will probably make
more sense for most setups.


On Mon, Oct 21, 2019 at 10:21 AM Sergio  wrote:

>
> Hello!
>
> This is the kernel that I am using
> Linux  4.16.13-1.el7.elrepo.x86_64 #1 SMP Wed May 30 14:31:51 EDT 2018
> x86_64 x86_64 x86_64 GNU/Linux
>
> Best,
>
> Sergio
>
> Il giorno lun 21 ott 2019 alle ore 07:30 Reid Pinchback <
> rpinchb...@tripadvisor.com> ha scritto:
>
>> I don't know which distro and version you are using, but watch out for
>> surprises in what vm.swappiness=0 means.  In older kernels it means "only
>> use swap when desperate".  I believe that newer kernels changed to have 1
>> mean that, and 0 means to always use the oomkiller.  Neither situation is
>> strictly good or bad, what matters is what you intend the system behavior
>> to be in comparison with whatever monitoring/alerting you have put in place.
>>
>> R
>>
>>
>> On 10/18/19, 9:04 PM, "Sergio Bilello" 
>> wrote:
>>
>>  Message from External Sender
>>
>> Hello everyone!
>>
>>
>>
>> Do you have any setting that you would change or tweak from the below
>> list?
>>
>>
>>
>> sudo cat /proc/4379/limits
>>
>> Limit Soft Limit   Hard Limit
>>  Units
>>
>> Max cpu time  unlimitedunlimited
>> seconds
>>
>> Max file size unlimitedunlimited
>> bytes
>>
>> Max data size unlimitedunlimited
>> bytes
>>
>> Max stack sizeunlimitedunlimited
>> bytes
>>
>> Max core file sizeunlimitedunlimited
>> bytes
>>
>> Max resident set  unlimitedunlimited
>> bytes
>>
>> Max processes 3276832768
>> processes
>>
>> Max open files1048576  1048576
>> files
>>
>> Max locked memory unlimitedunlimited
>> bytes
>>
>> Max address space unlimitedunlimited
>> bytes
>>
>> Max file locksunlimitedunlimited
>> locks
>>
>> Max pending signals   unlimitedunlimited
>> signals
>>
>> Max msgqueue size unlimitedunlimited
>> bytes
>>
>> Max nice priority 00
>>
>> Max realtime priority 00
>>
>> Max realtime timeout  unlimitedunlimitedus
>>
>>
>>
>> These are the sysctl settings
>>
>> default['cassandra']['sysctl'] = {
>>
>> 'net.ipv4.tcp_keepalive_time' => 60,
>>
>> 'net.ipv4.tcp_keepalive_probes' => 3,
>>
>> 'net.ipv4.tcp_keepalive_intvl' => 10,
>>
>> 'net.core.rmem_max' => 16777216,
>>
>> 'net.core.wmem_max' => 16777216,
>>
>> 'net.core.rmem_default' => 16777216,
>>
>> 'net.core.wmem_default' => 16777216,
>>
>> 'net.core.optmem_max' => 40960,
>>
>> 'net.ipv4.tcp_rmem' => '4096 87380 16777216',
>>
>> 'net.ipv4.tcp_wmem' => '4096 65536 16777216',
>>
>> 'net.ipv4.ip_local_port_range' => '1 65535',
>>
>> 'net.ipv4.tcp_window_scaling' => 1,
>>
>>'net.core.netdev_max_backlog' => 2500,
>>
>>'net.core.somaxconn' => 65000,
>>
>> 'vm.max_map_count' => 1048575,
>>
>> 'vm.swappiness' => 0
>>
>> }
>>
>>
>>
>> Am I missing something else?
>>
>>
>>
>> Do you have any experience to configure CENTOS 7
>>
>> for
>>
>> JAVA HUGE PAGES
>>
>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.datastax.com_en_dse_5.1_dse-2Dadmin_datastax-5Fenterprise_config_configRecommendedSettings.html-23CheckJavaHugepagessettings=DwIBaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=zke-WpkD1c6Qt1cz8mJG0ZQ37h8kezqknMSnerQhXuU=b6lGdbtv1SN9opBsIOFRT6IX6BroMW-8Tudk9qEh3bI=
>>
>>
>>
>> OPTIMIZE SSD
>>
>>
>> 

Re: [EXTERNAL] Re: GC Tuning https://thelastpickle.com/blog/2018/04/11/gc-tuning.html

2019-10-21 Thread Reid Pinchback
Think of GB to OS as something intended to support file caching.  As such the 
amount is whatever suits your usage.  If your use is almost exclusively 
reading, then file cache memory doesn’t matter that much if you’re operating 
with your storage as those nvme ssd drives that the i3’s come with.  There is 
already a chunk cache that you should be tuning in C* instead, and feeding fast 
from the O/S file cache, assuming compressed SSTables, maybe turns out to be 
less of a concern.

If you have moderate write activity then your situation changes because then 
that same file cache is how your dirty background pages turn into eventual 
flushes to disk, and so you have to watch the impact of read stalls when the 
I/O fills with write requests.  You might not see this so obviously on nvme 
drives, but that could depend a lot on the distro and kernels and how the 
filesystem is mounted.

My super strong advice on issues like this is to not cargo-cult other people’s 
tunings.  Look at them for ideas, sure. But learn how to do your own 
investigations, and budget the time for it into your project.  Budget a LOT of 
time for it if your measure of “good performance” is based on latency; when 
“good” is defined in terms of throughput your life is easier.  Also, everything 
is always a little different in virtualization, and lord knows you can have 
screwball things appear in AWS. The good news is you don’t need a perfect 
configuration out of the gate; you need a configuration you understand and can 
refine; understanding comes from knowing how to do your own performance 
monitoring.


From: Sergio 
Reply-To: "user@cassandra.apache.org" 
Date: Monday, October 21, 2019 at 1:16 PM
To: "user@cassandra.apache.org" 
Subject: Re: [EXTERNAL] Re: GC Tuning 
https://thelastpickle.com/blog/2018/04/11/gc-tuning.html

Message from External Sender
Thanks, guys!
I just copied and paste what I found on our test machines but I can confirm 
that we have the same settings except for 8GB in production.
I didn't select these settings and I need to verify why these settings are 
there.
If any of you want to share your flags for a read-heavy workload it would be 
appreciated, so I would replace and test those flags with TLP-STRESS.
I am thinking about different approaches (G1GC vs ParNew + CMS)
How many GB for RAM do you dedicate to the OS in percentage or in an exact 
number?
Can you share the flags for ParNew + CMS that I can play with it and perform a 
test?

Best,
Sergio

Il giorno lun 21 ott 2019 alle ore 09:27 Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>> ha scritto:
Since the instance size is < 32gb, hopefully swap isn’t being used, so it 
should be moot.

Sergio, also be aware that  -XX:+CMSClassUnloadingEnabled probably doesn’t do 
anything for you.  I believe that only applies to CMS, not G1GC.  I also 
wouldn’t take it as gospel truth that  -XX:+UseNUMA is a good thing on AWS (or 
anything virtualized), you’d have to run your own tests and find out.

R
From: Jon Haddad mailto:j...@jonhaddad.com>>
Reply-To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Date: Monday, October 21, 2019 at 12:06 PM
To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Subject: Re: [EXTERNAL] Re: GC Tuning 
https://thelastpickle.com/blog/2018/04/11/gc-tuning.html

Message from External Sender
One thing to note, if you're going to use a big heap, cap it at 31GB, not 32.  
Once you go to 32GB, you don't get to use compressed pointers [1], so you get 
less addressable space than at 31GB.

[1] 
https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/

On Mon, Oct 21, 2019 at 11:39 AM Durity, Sean R 
mailto:sean_r_dur...@homedepot.com>> wrote:
I don’t disagree with Jon, who has all kinds of performance tuning experience. 
But for ease of operation, we only use G1GC (on Java 8), because the tuning of 
ParNew+CMS requires a high degree of knowledge and very repeatable testing 
harnesses. It isn’t worth our time. As a previous writer mentioned, there is 
usually better return on our time tuning the schema (aka helping developers 
understand Cassandra’s strengths).

We use 16 – 32 GB heaps, nothing smaller than that.

Sean Durity

From: Jon Haddad mailto:j...@jonhaddad.com>>
Sent: Monday, October 21, 2019 10:43 AM
To: 

Re: Cassandra Recommended System Settings

2019-10-21 Thread Sergio
Hello!

This is the kernel that I am using
Linux  4.16.13-1.el7.elrepo.x86_64 #1 SMP Wed May 30 14:31:51 EDT 2018
x86_64 x86_64 x86_64 GNU/Linux

Best,

Sergio

Il giorno lun 21 ott 2019 alle ore 07:30 Reid Pinchback <
rpinchb...@tripadvisor.com> ha scritto:

> I don't know which distro and version you are using, but watch out for
> surprises in what vm.swappiness=0 means.  In older kernels it means "only
> use swap when desperate".  I believe that newer kernels changed to have 1
> mean that, and 0 means to always use the oomkiller.  Neither situation is
> strictly good or bad, what matters is what you intend the system behavior
> to be in comparison with whatever monitoring/alerting you have put in place.
>
> R
>
>
> On 10/18/19, 9:04 PM, "Sergio Bilello"  wrote:
>
>  Message from External Sender
>
> Hello everyone!
>
>
>
> Do you have any setting that you would change or tweak from the below
> list?
>
>
>
> sudo cat /proc/4379/limits
>
> Limit Soft Limit   Hard Limit
>  Units
>
> Max cpu time  unlimitedunlimited
> seconds
>
> Max file size unlimitedunlimited
> bytes
>
> Max data size unlimitedunlimited
> bytes
>
> Max stack sizeunlimitedunlimited
> bytes
>
> Max core file sizeunlimitedunlimited
> bytes
>
> Max resident set  unlimitedunlimited
> bytes
>
> Max processes 3276832768
> processes
>
> Max open files1048576  1048576
> files
>
> Max locked memory unlimitedunlimited
> bytes
>
> Max address space unlimitedunlimited
> bytes
>
> Max file locksunlimitedunlimited
> locks
>
> Max pending signals   unlimitedunlimited
> signals
>
> Max msgqueue size unlimitedunlimited
> bytes
>
> Max nice priority 00
>
> Max realtime priority 00
>
> Max realtime timeout  unlimitedunlimitedus
>
>
>
> These are the sysctl settings
>
> default['cassandra']['sysctl'] = {
>
> 'net.ipv4.tcp_keepalive_time' => 60,
>
> 'net.ipv4.tcp_keepalive_probes' => 3,
>
> 'net.ipv4.tcp_keepalive_intvl' => 10,
>
> 'net.core.rmem_max' => 16777216,
>
> 'net.core.wmem_max' => 16777216,
>
> 'net.core.rmem_default' => 16777216,
>
> 'net.core.wmem_default' => 16777216,
>
> 'net.core.optmem_max' => 40960,
>
> 'net.ipv4.tcp_rmem' => '4096 87380 16777216',
>
> 'net.ipv4.tcp_wmem' => '4096 65536 16777216',
>
> 'net.ipv4.ip_local_port_range' => '1 65535',
>
> 'net.ipv4.tcp_window_scaling' => 1,
>
>'net.core.netdev_max_backlog' => 2500,
>
>'net.core.somaxconn' => 65000,
>
> 'vm.max_map_count' => 1048575,
>
> 'vm.swappiness' => 0
>
> }
>
>
>
> Am I missing something else?
>
>
>
> Do you have any experience to configure CENTOS 7
>
> for
>
> JAVA HUGE PAGES
>
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.datastax.com_en_dse_5.1_dse-2Dadmin_datastax-5Fenterprise_config_configRecommendedSettings.html-23CheckJavaHugepagessettings=DwIBaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=zke-WpkD1c6Qt1cz8mJG0ZQ37h8kezqknMSnerQhXuU=b6lGdbtv1SN9opBsIOFRT6IX6BroMW-8Tudk9qEh3bI=
>
>
>
> OPTIMIZE SSD
>
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.datastax.com_en_dse_5.1_dse-2Dadmin_datastax-5Fenterprise_config_configRecommendedSettings.html-23OptimizeSSDs=DwIBaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=zke-WpkD1c6Qt1cz8mJG0ZQ37h8kezqknMSnerQhXuU=c0S3S3V_0YHVMx2I-pyOh24MiQs1D-L73JytaSw648M=
>
>
>
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.datastax.com_en_dse_5.1_dse-2Dadmin_datastax-5Fenterprise_config_configRecommendedSettings.html=DwIBaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=zke-WpkD1c6Qt1cz8mJG0ZQ37h8kezqknMSnerQhXuU=PZFG6SXF6dL5LRJ-aUoidHnnLGpKPbpxdKstM8M9JMk=
>
>
>
> We are using AWS i3.xlarge instances
>
>
>
> Thanks,
>
>
>
> Sergio
>
>
>
> -
>
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>
>
>
>
>


Re: [EXTERNAL] Re: GC Tuning https://thelastpickle.com/blog/2018/04/11/gc-tuning.html

2019-10-21 Thread Sergio
Thanks, guys!
I just copied and paste what I found on our test machines but I can confirm
that we have the same settings except for 8GB in production.
I didn't select these settings and I need to verify why these settings are
there.
If any of you want to share your flags for a read-heavy workload it would
be appreciated, so I would replace and test those flags with TLP-STRESS.
I am thinking about different approaches (G1GC vs ParNew + CMS)
How many GB for RAM do you dedicate to the OS in percentage or in an exact
number?
Can you share the flags for ParNew + CMS that I can play with it and
perform a test?

Best,
Sergio


Il giorno lun 21 ott 2019 alle ore 09:27 Reid Pinchback <
rpinchb...@tripadvisor.com> ha scritto:

> Since the instance size is < 32gb, hopefully swap isn’t being used, so it
> should be moot.
>
>
>
> Sergio, also be aware that  -XX:+CMSClassUnloadingEnabled probably
> doesn’t do anything for you.  I believe that only applies to CMS, not
> G1GC.  I also wouldn’t take it as gospel truth that  -XX:+UseNUMA is a good
> thing on AWS (or anything virtualized), you’d have to run your own tests
> and find out.
>
>
>
> R
>
> *From: *Jon Haddad 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Monday, October 21, 2019 at 12:06 PM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Re: [EXTERNAL] Re: GC Tuning
> https://thelastpickle.com/blog/2018/04/11/gc-tuning.html
>
>
>
> *Message from External Sender*
>
> One thing to note, if you're going to use a big heap, cap it at 31GB, not
> 32.  Once you go to 32GB, you don't get to use compressed pointers [1], so
> you get less addressable space than at 31GB.
>
>
>
> [1]
> https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/
> 
>
>
>
> On Mon, Oct 21, 2019 at 11:39 AM Durity, Sean R <
> sean_r_dur...@homedepot.com> wrote:
>
> I don’t disagree with Jon, who has all kinds of performance tuning
> experience. But for ease of operation, we only use G1GC (on Java 8),
> because the tuning of ParNew+CMS requires a high degree of knowledge and
> very repeatable testing harnesses. It isn’t worth our time. As a previous
> writer mentioned, there is usually better return on our time tuning the
> schema (aka helping developers understand Cassandra’s strengths).
>
>
>
> We use 16 – 32 GB heaps, nothing smaller than that.
>
>
>
> Sean Durity
>
>
>
> *From:* Jon Haddad 
> *Sent:* Monday, October 21, 2019 10:43 AM
> *To:* user@cassandra.apache.org
> *Subject:* [EXTERNAL] Re: GC Tuning
> https://thelastpickle.com/blog/2018/04/11/gc-tuning.html
> 
>
>
>
> I still use ParNew + CMS over G1GC with Java 8.  I haven't done a
> comparison with JDK 11 yet, so I'm not sure if it's any better.  I've heard
> it is, but I like to verify first.  The pause times with ParNew + CMS are
> generally lower than G1 when tuned right, but as Chris said it can be
> tricky.  If you aren't willing to spend the time understanding how it works
> and why each setting matters, G1 is a better option.
>
>
>
> I wouldn't run Cassandra in production on less than 8GB of heap - I
> consider it the absolute minimum.  For G1 I'd use 16GB, and never 4GB with
> Cassandra unless you're rarely querying it.
>
>
>
> I typically use the following as a starting point now:
>
>
>
> ParNew + CMS
>
> 16GB heap
>
> 10GB new gen
>
> 2GB memtable cap, otherwise you'll spend a bunch of time copying around
> memtables (cassandra.yaml)
>
> Max tenuring threshold: 2
>
> survivor ratio 6
>
>
>
> I've also done some tests with a 30GB heap, 24 GB of which was new gen.
> This worked surprisingly well in my tests since it essentially keeps
> everything out of the old gen.  New gen allocations are just a pointer bump
> and are pretty fast, so in my (limited) tests of this I was seeing really
> good p99 times.  I was seeing a 200-400 ms pause roughly once a minute
> running a workload that deliberately wasn't hitting a resource limit
> (testing real world looking stress vs overwhelming the cluster).
>
>
>
> We built tlp-cluster [1] and tlp-stress [2] to help figure these things
> out.
>
>
>
> [1] https://thelastpickle.com/tlp-cluster/ [thelastpickle.com]
> 
>
> [2] http://thelastpickle.com/tlp-stress [thelastpickle.com]
> 

Re: [EXTERNAL] Re: GC Tuning https://thelastpickle.com/blog/2018/04/11/gc-tuning.html

2019-10-21 Thread Reid Pinchback
Since the instance size is < 32gb, hopefully swap isn’t being used, so it 
should be moot.

Sergio, also be aware that  -XX:+CMSClassUnloadingEnabled probably doesn’t do 
anything for you.  I believe that only applies to CMS, not G1GC.  I also 
wouldn’t take it as gospel truth that  -XX:+UseNUMA is a good thing on AWS (or 
anything virtualized), you’d have to run your own tests and find out.

R

From: Jon Haddad 
Reply-To: "user@cassandra.apache.org" 
Date: Monday, October 21, 2019 at 12:06 PM
To: "user@cassandra.apache.org" 
Subject: Re: [EXTERNAL] Re: GC Tuning 
https://thelastpickle.com/blog/2018/04/11/gc-tuning.html

Message from External Sender
One thing to note, if you're going to use a big heap, cap it at 31GB, not 32.  
Once you go to 32GB, you don't get to use compressed pointers [1], so you get 
less addressable space than at 31GB.

[1] 
https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/

On Mon, Oct 21, 2019 at 11:39 AM Durity, Sean R 
mailto:sean_r_dur...@homedepot.com>> wrote:
I don’t disagree with Jon, who has all kinds of performance tuning experience. 
But for ease of operation, we only use G1GC (on Java 8), because the tuning of 
ParNew+CMS requires a high degree of knowledge and very repeatable testing 
harnesses. It isn’t worth our time. As a previous writer mentioned, there is 
usually better return on our time tuning the schema (aka helping developers 
understand Cassandra’s strengths).

We use 16 – 32 GB heaps, nothing smaller than that.

Sean Durity

From: Jon Haddad mailto:j...@jonhaddad.com>>
Sent: Monday, October 21, 2019 10:43 AM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: GC Tuning 
https://thelastpickle.com/blog/2018/04/11/gc-tuning.html

I still use ParNew + CMS over G1GC with Java 8.  I haven't done a comparison 
with JDK 11 yet, so I'm not sure if it's any better.  I've heard it is, but I 
like to verify first.  The pause times with ParNew + CMS are generally lower 
than G1 when tuned right, but as Chris said it can be tricky.  If you aren't 
willing to spend the time understanding how it works and why each setting 
matters, G1 is a better option.

I wouldn't run Cassandra in production on less than 8GB of heap - I consider it 
the absolute minimum.  For G1 I'd use 16GB, and never 4GB with Cassandra unless 
you're rarely querying it.

I typically use the following as a starting point now:

ParNew + CMS
16GB heap
10GB new gen
2GB memtable cap, otherwise you'll spend a bunch of time copying around 
memtables (cassandra.yaml)
Max tenuring threshold: 2
survivor ratio 6

I've also done some tests with a 30GB heap, 24 GB of which was new gen.  This 
worked surprisingly well in my tests since it essentially keeps everything out 
of the old gen.  New gen allocations are just a pointer bump and are pretty 
fast, so in my (limited) tests of this I was seeing really good p99 times.  I 
was seeing a 200-400 ms pause roughly once a minute running a workload that 
deliberately wasn't hitting a resource limit (testing real world looking stress 
vs overwhelming the cluster).

We built tlp-cluster [1] and tlp-stress [2] to help figure these things out.

[1] https://thelastpickle.com/tlp-cluster/ 
[thelastpickle.com]
[2] http://thelastpickle.com/tlp-stress 
[thelastpickle.com]

Jon




On Mon, Oct 21, 2019 at 10:24 AM Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>> wrote:
An i3x large has 30.5 gb of RAM but you’re using less than 4gb for C*.  So 
minus room for other uses of jvm memory and for kernel activity, that’s about 
25 gb for file cache.  You’ll have to see if you either want a bigger heap to 
allow for less frequent gc cycles, or you could save money on the instance 
size.  C* generates a lot of medium-length lifetime objects which can easily 
end up in old gen.  A larger heap will reduce the burn of more old-gen 
collections.  There are no magic numbers to just give because it’ll depend on 
your usage patterns.

From: Sergio mailto:lapostadiser...@gmail.com>>
Reply-To: 

Re: [EXTERNAL] Re: GC Tuning https://thelastpickle.com/blog/2018/04/11/gc-tuning.html

2019-10-21 Thread Jon Haddad
One thing to note, if you're going to use a big heap, cap it at 31GB, not
32.  Once you go to 32GB, you don't get to use compressed pointers [1], so
you get less addressable space than at 31GB.

[1]
https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/

On Mon, Oct 21, 2019 at 11:39 AM Durity, Sean R 
wrote:

> I don’t disagree with Jon, who has all kinds of performance tuning
> experience. But for ease of operation, we only use G1GC (on Java 8),
> because the tuning of ParNew+CMS requires a high degree of knowledge and
> very repeatable testing harnesses. It isn’t worth our time. As a previous
> writer mentioned, there is usually better return on our time tuning the
> schema (aka helping developers understand Cassandra’s strengths).
>
>
>
> We use 16 – 32 GB heaps, nothing smaller than that.
>
>
>
> Sean Durity
>
>
>
> *From:* Jon Haddad 
> *Sent:* Monday, October 21, 2019 10:43 AM
> *To:* user@cassandra.apache.org
> *Subject:* [EXTERNAL] Re: GC Tuning
> https://thelastpickle.com/blog/2018/04/11/gc-tuning.html
>
>
>
> I still use ParNew + CMS over G1GC with Java 8.  I haven't done a
> comparison with JDK 11 yet, so I'm not sure if it's any better.  I've heard
> it is, but I like to verify first.  The pause times with ParNew + CMS are
> generally lower than G1 when tuned right, but as Chris said it can be
> tricky.  If you aren't willing to spend the time understanding how it works
> and why each setting matters, G1 is a better option.
>
>
>
> I wouldn't run Cassandra in production on less than 8GB of heap - I
> consider it the absolute minimum.  For G1 I'd use 16GB, and never 4GB with
> Cassandra unless you're rarely querying it.
>
>
>
> I typically use the following as a starting point now:
>
>
>
> ParNew + CMS
>
> 16GB heap
>
> 10GB new gen
>
> 2GB memtable cap, otherwise you'll spend a bunch of time copying around
> memtables (cassandra.yaml)
>
> Max tenuring threshold: 2
>
> survivor ratio 6
>
>
>
> I've also done some tests with a 30GB heap, 24 GB of which was new gen.
> This worked surprisingly well in my tests since it essentially keeps
> everything out of the old gen.  New gen allocations are just a pointer bump
> and are pretty fast, so in my (limited) tests of this I was seeing really
> good p99 times.  I was seeing a 200-400 ms pause roughly once a minute
> running a workload that deliberately wasn't hitting a resource limit
> (testing real world looking stress vs overwhelming the cluster).
>
>
>
> We built tlp-cluster [1] and tlp-stress [2] to help figure these things
> out.
>
>
>
> [1] https://thelastpickle.com/tlp-cluster/ [thelastpickle.com]
> 
>
> [2] http://thelastpickle.com/tlp-stress [thelastpickle.com]
> 
>
>
>
> Jon
>
>
>
>
>
>
>
>
>
> On Mon, Oct 21, 2019 at 10:24 AM Reid Pinchback <
> rpinchb...@tripadvisor.com> wrote:
>
> An i3x large has 30.5 gb of RAM but you’re using less than 4gb for C*.  So
> minus room for other uses of jvm memory and for kernel activity, that’s
> about 25 gb for file cache.  You’ll have to see if you either want a bigger
> heap to allow for less frequent gc cycles, or you could save money on the
> instance size.  C* generates a lot of medium-length lifetime objects which
> can easily end up in old gen.  A larger heap will reduce the burn of more
> old-gen collections.  There are no magic numbers to just give because it’ll
> depend on your usage patterns.
>
>
>
> *From: *Sergio 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Sunday, October 20, 2019 at 2:51 PM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Re: GC Tuning 
> https://thelastpickle.com/blog/2018/04/11/gc-tuning.html
> [thelastpickle.com]
> 
>
>
>
> *Message from External Sender*
>
> Thanks for the answer.
>
> This is the JVM version that I have right now.
>
> openjdk version "1.8.0_161"
> OpenJDK Runtime Environment (build 1.8.0_161-b14)
> OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode)
>
> These are the current flags. Would you change anything in a i3x.large aws
> node?
>
> java -Xloggc:/var/log/cassandra/gc.log
> -Dcassandra.max_queued_native_transport_requests=4096 -ea
> -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42
> -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103
> -XX:+AlwaysPreTouch -XX:-UseBiasedLocking -XX:+UseTLAB -XX:+ResizeTLAB
> -XX:+UseNUMA -XX:+PerfDisableSharedMem -Djava.net.preferIPv4Stack=true
> -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:+UseG1GC
> -XX:G1RSetUpdatingPauseTimePercent=5 -XX:MaxGCPauseMillis=200
> 

RE: [EXTERNAL] Re: GC Tuning https://thelastpickle.com/blog/2018/04/11/gc-tuning.html

2019-10-21 Thread Durity, Sean R
I don’t disagree with Jon, who has all kinds of performance tuning experience. 
But for ease of operation, we only use G1GC (on Java 8), because the tuning of 
ParNew+CMS requires a high degree of knowledge and very repeatable testing 
harnesses. It isn’t worth our time. As a previous writer mentioned, there is 
usually better return on our time tuning the schema (aka helping developers 
understand Cassandra’s strengths).

We use 16 – 32 GB heaps, nothing smaller than that.

Sean Durity

From: Jon Haddad 
Sent: Monday, October 21, 2019 10:43 AM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: GC Tuning 
https://thelastpickle.com/blog/2018/04/11/gc-tuning.html

I still use ParNew + CMS over G1GC with Java 8.  I haven't done a comparison 
with JDK 11 yet, so I'm not sure if it's any better.  I've heard it is, but I 
like to verify first.  The pause times with ParNew + CMS are generally lower 
than G1 when tuned right, but as Chris said it can be tricky.  If you aren't 
willing to spend the time understanding how it works and why each setting 
matters, G1 is a better option.

I wouldn't run Cassandra in production on less than 8GB of heap - I consider it 
the absolute minimum.  For G1 I'd use 16GB, and never 4GB with Cassandra unless 
you're rarely querying it.

I typically use the following as a starting point now:

ParNew + CMS
16GB heap
10GB new gen
2GB memtable cap, otherwise you'll spend a bunch of time copying around 
memtables (cassandra.yaml)
Max tenuring threshold: 2
survivor ratio 6

I've also done some tests with a 30GB heap, 24 GB of which was new gen.  This 
worked surprisingly well in my tests since it essentially keeps everything out 
of the old gen.  New gen allocations are just a pointer bump and are pretty 
fast, so in my (limited) tests of this I was seeing really good p99 times.  I 
was seeing a 200-400 ms pause roughly once a minute running a workload that 
deliberately wasn't hitting a resource limit (testing real world looking stress 
vs overwhelming the cluster).

We built tlp-cluster [1] and tlp-stress [2] to help figure these things out.

[1] https://thelastpickle.com/tlp-cluster/ 
[thelastpickle.com]
[2] http://thelastpickle.com/tlp-stress 
[thelastpickle.com]

Jon




On Mon, Oct 21, 2019 at 10:24 AM Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>> wrote:
An i3x large has 30.5 gb of RAM but you’re using less than 4gb for C*.  So 
minus room for other uses of jvm memory and for kernel activity, that’s about 
25 gb for file cache.  You’ll have to see if you either want a bigger heap to 
allow for less frequent gc cycles, or you could save money on the instance 
size.  C* generates a lot of medium-length lifetime objects which can easily 
end up in old gen.  A larger heap will reduce the burn of more old-gen 
collections.  There are no magic numbers to just give because it’ll depend on 
your usage patterns.

From: Sergio mailto:lapostadiser...@gmail.com>>
Reply-To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Date: Sunday, October 20, 2019 at 2:51 PM
To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Subject: Re: GC Tuning https://thelastpickle.com/blog/2018/04/11/gc-tuning.html 
[thelastpickle.com]

Message from External Sender
Thanks for the answer.

This is the JVM version that I have right now.

openjdk version "1.8.0_161"
OpenJDK Runtime Environment (build 1.8.0_161-b14)
OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode)

These are the current flags. Would you change anything in a i3x.large aws node?

java -Xloggc:/var/log/cassandra/gc.log 
-Dcassandra.max_queued_native_transport_requests=4096 -ea 
-XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 
-XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 
-XX:+AlwaysPreTouch -XX:-UseBiasedLocking -XX:+UseTLAB -XX:+ResizeTLAB 
-XX:+UseNUMA -XX:+PerfDisableSharedMem -Djava.net.preferIPv4Stack=true 
-XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:+UseG1GC 
-XX:G1RSetUpdatingPauseTimePercent=5 -XX:MaxGCPauseMillis=200 
-XX:InitiatingHeapOccupancyPercent=45 -XX:G1HeapRegionSize=0 
-XX:-ParallelRefProcEnabled -Xms3821M -Xmx3821M 
-XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler 
-Dcom.sun.management.jmxremote.port=7199 
-Dcom.sun.management.jmxremote.rmi.port=7199 
-Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.authenticate=false 

Re: GC Tuning https://thelastpickle.com/blog/2018/04/11/gc-tuning.html

2019-10-21 Thread Jon Haddad
I still use ParNew + CMS over G1GC with Java 8.  I haven't done a
comparison with JDK 11 yet, so I'm not sure if it's any better.  I've heard
it is, but I like to verify first.  The pause times with ParNew + CMS are
generally lower than G1 when tuned right, but as Chris said it can be
tricky.  If you aren't willing to spend the time understanding how it works
and why each setting matters, G1 is a better option.

I wouldn't run Cassandra in production on less than 8GB of heap - I
consider it the absolute minimum.  For G1 I'd use 16GB, and never 4GB with
Cassandra unless you're rarely querying it.

I typically use the following as a starting point now:

ParNew + CMS
16GB heap
10GB new gen
2GB memtable cap, otherwise you'll spend a bunch of time copying around
memtables (cassandra.yaml)
Max tenuring threshold: 2
survivor ratio 6

I've also done some tests with a 30GB heap, 24 GB of which was new gen.
This worked surprisingly well in my tests since it essentially keeps
everything out of the old gen.  New gen allocations are just a pointer bump
and are pretty fast, so in my (limited) tests of this I was seeing really
good p99 times.  I was seeing a 200-400 ms pause roughly once a minute
running a workload that deliberately wasn't hitting a resource limit
(testing real world looking stress vs overwhelming the cluster).

We built tlp-cluster [1] and tlp-stress [2] to help figure these things
out.

[1] https://thelastpickle.com/tlp-cluster/
[2] http://thelastpickle.com/tlp-stress

Jon




On Mon, Oct 21, 2019 at 10:24 AM Reid Pinchback 
wrote:

> An i3x large has 30.5 gb of RAM but you’re using less than 4gb for C*.  So
> minus room for other uses of jvm memory and for kernel activity, that’s
> about 25 gb for file cache.  You’ll have to see if you either want a bigger
> heap to allow for less frequent gc cycles, or you could save money on the
> instance size.  C* generates a lot of medium-length lifetime objects which
> can easily end up in old gen.  A larger heap will reduce the burn of more
> old-gen collections.  There are no magic numbers to just give because it’ll
> depend on your usage patterns.
>
>
>
> *From: *Sergio 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Sunday, October 20, 2019 at 2:51 PM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Re: GC Tuning
> https://thelastpickle.com/blog/2018/04/11/gc-tuning.html
>
>
>
> *Message from External Sender*
>
> Thanks for the answer.
>
> This is the JVM version that I have right now.
>
> openjdk version "1.8.0_161"
> OpenJDK Runtime Environment (build 1.8.0_161-b14)
> OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode)
>
> These are the current flags. Would you change anything in a i3x.large aws
> node?
>
> java -Xloggc:/var/log/cassandra/gc.log
> -Dcassandra.max_queued_native_transport_requests=4096 -ea
> -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42
> -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103
> -XX:+AlwaysPreTouch -XX:-UseBiasedLocking -XX:+UseTLAB -XX:+ResizeTLAB
> -XX:+UseNUMA -XX:+PerfDisableSharedMem -Djava.net.preferIPv4Stack=true
> -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:+UseG1GC
> -XX:G1RSetUpdatingPauseTimePercent=5 -XX:MaxGCPauseMillis=200
> -XX:InitiatingHeapOccupancyPercent=45 -XX:G1HeapRegionSize=0
> -XX:-ParallelRefProcEnabled -Xms3821M -Xmx3821M
> -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler
> -Dcom.sun.management.jmxremote.port=7199
> -Dcom.sun.management.jmxremote.rmi.port=7199
> -Dcom.sun.management.jmxremote.ssl=false
> -Dcom.sun.management.jmxremote.authenticate=false
> -Dcom.sun.management.jmxremote.password.file=/etc/cassandra/conf/jmxremote.password
> -Dcom.sun.management.jmxremote.access.file=/etc/cassandra/conf/jmxremote.access
> -Djava.library.path=/usr/share/cassandra/lib/sigar-bin
> -Djava.rmi.server.hostname=172.24.150.141 -XX:+CMSClassUnloadingEnabled
> -javaagent:/usr/share/cassandra/lib/jmx_prometheus_javaagent-0.3.1.jar=10100:/etc/cassandra/default.conf/jmx-export.yml
> -Dlogback.configurationFile=logback.xml
> -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir=
> -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid
> -Dcassandra-foreground=yes -cp
> 

Re: Cassandra Recommended System Settings

2019-10-21 Thread Reid Pinchback
I don't know which distro and version you are using, but watch out for 
surprises in what vm.swappiness=0 means.  In older kernels it means "only use 
swap when desperate".  I believe that newer kernels changed to have 1 mean 
that, and 0 means to always use the oomkiller.  Neither situation is strictly 
good or bad, what matters is what you intend the system behavior to be in 
comparison with whatever monitoring/alerting you have put in place.

R


On 10/18/19, 9:04 PM, "Sergio Bilello"  wrote:

 Message from External Sender

Hello everyone!



Do you have any setting that you would change or tweak from the below list?



sudo cat /proc/4379/limits

Limit Soft Limit   Hard Limit   Units

Max cpu time  unlimitedunlimitedseconds

Max file size unlimitedunlimitedbytes

Max data size unlimitedunlimitedbytes

Max stack sizeunlimitedunlimitedbytes

Max core file sizeunlimitedunlimitedbytes

Max resident set  unlimitedunlimitedbytes

Max processes 3276832768
processes

Max open files1048576  1048576  files

Max locked memory unlimitedunlimitedbytes

Max address space unlimitedunlimitedbytes

Max file locksunlimitedunlimitedlocks

Max pending signals   unlimitedunlimitedsignals

Max msgqueue size unlimitedunlimitedbytes

Max nice priority 00

Max realtime priority 00

Max realtime timeout  unlimitedunlimitedus



These are the sysctl settings

default['cassandra']['sysctl'] = {

'net.ipv4.tcp_keepalive_time' => 60, 

'net.ipv4.tcp_keepalive_probes' => 3, 

'net.ipv4.tcp_keepalive_intvl' => 10,

'net.core.rmem_max' => 16777216,

'net.core.wmem_max' => 16777216,

'net.core.rmem_default' => 16777216,

'net.core.wmem_default' => 16777216,

'net.core.optmem_max' => 40960,

'net.ipv4.tcp_rmem' => '4096 87380 16777216',

'net.ipv4.tcp_wmem' => '4096 65536 16777216',

'net.ipv4.ip_local_port_range' => '1 65535',

'net.ipv4.tcp_window_scaling' => 1,

   'net.core.netdev_max_backlog' => 2500,

   'net.core.somaxconn' => 65000,

'vm.max_map_count' => 1048575,

'vm.swappiness' => 0

}



Am I missing something else?



Do you have any experience to configure CENTOS 7

for 

JAVA HUGE PAGES


https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.datastax.com_en_dse_5.1_dse-2Dadmin_datastax-5Fenterprise_config_configRecommendedSettings.html-23CheckJavaHugepagessettings=DwIBaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=zke-WpkD1c6Qt1cz8mJG0ZQ37h8kezqknMSnerQhXuU=b6lGdbtv1SN9opBsIOFRT6IX6BroMW-8Tudk9qEh3bI=
 



OPTIMIZE SSD


https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.datastax.com_en_dse_5.1_dse-2Dadmin_datastax-5Fenterprise_config_configRecommendedSettings.html-23OptimizeSSDs=DwIBaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=zke-WpkD1c6Qt1cz8mJG0ZQ37h8kezqknMSnerQhXuU=c0S3S3V_0YHVMx2I-pyOh24MiQs1D-L73JytaSw648M=
 




https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.datastax.com_en_dse_5.1_dse-2Dadmin_datastax-5Fenterprise_config_configRecommendedSettings.html=DwIBaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=zke-WpkD1c6Qt1cz8mJG0ZQ37h8kezqknMSnerQhXuU=PZFG6SXF6dL5LRJ-aUoidHnnLGpKPbpxdKstM8M9JMk=
 



We are using AWS i3.xlarge instances



Thanks,



Sergio



-

To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org

For additional commands, e-mail: user-h...@cassandra.apache.org







Re: GC Tuning https://thelastpickle.com/blog/2018/04/11/gc-tuning.html

2019-10-21 Thread Reid Pinchback
An i3x large has 30.5 gb of RAM but you’re using less than 4gb for C*.  So 
minus room for other uses of jvm memory and for kernel activity, that’s about 
25 gb for file cache.  You’ll have to see if you either want a bigger heap to 
allow for less frequent gc cycles, or you could save money on the instance 
size.  C* generates a lot of medium-length lifetime objects which can easily 
end up in old gen.  A larger heap will reduce the burn of more old-gen 
collections.  There are no magic numbers to just give because it’ll depend on 
your usage patterns.

From: Sergio 
Reply-To: "user@cassandra.apache.org" 
Date: Sunday, October 20, 2019 at 2:51 PM
To: "user@cassandra.apache.org" 
Subject: Re: GC Tuning https://thelastpickle.com/blog/2018/04/11/gc-tuning.html

Message from External Sender
Thanks for the answer.

This is the JVM version that I have right now.

openjdk version "1.8.0_161"
OpenJDK Runtime Environment (build 1.8.0_161-b14)
OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode)

These are the current flags. Would you change anything in a i3x.large aws node?

java -Xloggc:/var/log/cassandra/gc.log 
-Dcassandra.max_queued_native_transport_requests=4096 -ea 
-XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 
-XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 
-XX:+AlwaysPreTouch -XX:-UseBiasedLocking -XX:+UseTLAB -XX:+ResizeTLAB 
-XX:+UseNUMA -XX:+PerfDisableSharedMem -Djava.net.preferIPv4Stack=true 
-XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:+UseG1GC 
-XX:G1RSetUpdatingPauseTimePercent=5 -XX:MaxGCPauseMillis=200 
-XX:InitiatingHeapOccupancyPercent=45 -XX:G1HeapRegionSize=0 
-XX:-ParallelRefProcEnabled -Xms3821M -Xmx3821M 
-XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler 
-Dcom.sun.management.jmxremote.port=7199 
-Dcom.sun.management.jmxremote.rmi.port=7199 
-Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.password.file=/etc/cassandra/conf/jmxremote.password
 
-Dcom.sun.management.jmxremote.access.file=/etc/cassandra/conf/jmxremote.access 
-Djava.library.path=/usr/share/cassandra/lib/sigar-bin 
-Djava.rmi.server.hostname=172.24.150.141 -XX:+CMSClassUnloadingEnabled 
-javaagent:/usr/share/cassandra/lib/jmx_prometheus_javaagent-0.3.1.jar=10100:/etc/cassandra/default.conf/jmx-export.yml
 -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra 
-Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid 
-Dcassandra-foreground=yes -cp