Re: High CPU usage on some of nodes

2015-09-11 Thread Roman Tkachenko
I have another datapoint from our monitoring system that shows huge
outbound network traffic increase for the affected boxes during these
spikes:

[image: Inline image 1]

Looking at inbound traffic, it is increased on nodes other than these
(purple, yellow and blue) so it does look like some kind of excessive
internode communication is going on between these 3 nodes and the rest of
the cluster.

What could these network spikes be a sign of?


On Thu, Sep 10, 2015 at 12:00 PM, Graham Sanderson  wrote:

> Haven’t been following this thread, but we run beefy machines with 8gig
> new gen, 12 gig old gen (down from 16g since moving memtables off heap, we
> can probably go lower)…
>
> Apart from making sure you have all the latest -XX: flags from
> cassandra-env.sh (and MALLOC_ARENA_MAX), I personally would recommend
> running latest 2.1.x with
>
> memory_allocator: JEMallocAllocator
> memtable_allocation_type: offheap_objects
>
> Some people will probably disagree, but it works great for us (rare long
> pauses sub 2 secs), and if you’re seeing slow GC because of promotion
> failure of objects 131074 dwords big, then I definitely suggest you give it
> a try.
>
> On Sep 10, 2015, at 1:43 PM, Robert Coli  wrote:
>
> On Thu, Sep 10, 2015 at 10:54 AM, Roman Tkachenko 
> wrote:
>>
>> [5 second CMS GC] Is my best shot to play with JVM settings trying to
>> tune garbage collection then?
>>
>
> Yep. As a minor note, if the machines are that beefy, they probably have a
> lot of RAM, you might wish to consider trying G1 GC and a larger heap.
>
> =Rob
>
>
>
>
>


Re: High CPU usage on some of nodes

2015-09-11 Thread Graham Sanderson
again I haven’t read this thread from the beginning so I don’t know which node 
is which, but if nodes pause for longish GC, then other nodes will likely be 
saving hints (assuming you are writing at the time), then they will be 
delivered once the machines become responsive again. I’m just guessing though. 
Take a look at the hinting metrics.
> On Sep 11, 2015, at 2:45 PM, Roman Tkachenko  wrote:
> 
> I have another datapoint from our monitoring system that shows huge outbound 
> network traffic increase for the affected boxes during these spikes:
> 
> 
> 
> Looking at inbound traffic, it is increased on nodes other than these 
> (purple, yellow and blue) so it does look like some kind of excessive 
> internode communication is going on between these 3 nodes and the rest of the 
> cluster.
> 
> What could these network spikes be a sign of?
> 
> 
> On Thu, Sep 10, 2015 at 12:00 PM, Graham Sanderson  > wrote:
> Haven’t been following this thread, but we run beefy machines with 8gig new 
> gen, 12 gig old gen (down from 16g since moving memtables off heap, we can 
> probably go lower)…
> 
> Apart from making sure you have all the latest -XX: flags from 
> cassandra-env.sh (and MALLOC_ARENA_MAX), I personally would recommend running 
> latest 2.1.x with
> 
> memory_allocator: JEMallocAllocator
> memtable_allocation_type: offheap_objects
> 
> Some people will probably disagree, but it works great for us (rare long 
> pauses sub 2 secs), and if you’re seeing slow GC because of promotion failure 
> of objects 131074 dwords big, then I definitely suggest you give it a try.
> 
>> On Sep 10, 2015, at 1:43 PM, Robert Coli > > wrote:
>> 
>> On Thu, Sep 10, 2015 at 10:54 AM, Roman Tkachenko > > wrote: 
>> [5 second CMS GC] Is my best shot to play with JVM settings trying to tune 
>> garbage collection then?
>> 
>> Yep. As a minor note, if the machines are that beefy, they probably have a 
>> lot of RAM, you might wish to consider trying G1 GC and a larger heap.
>> 
>> =Rob
>> 
>>  
> 
> 



smime.p7s
Description: S/MIME cryptographic signature


Re: High CPU usage on some of nodes

2015-09-11 Thread Otis Gospodnetić
A quick and dirty way is to run jstack a few times and see if you can spot
some common methods where code is spending time.

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On Thu, Sep 10, 2015 at 1:05 AM, Roman Tkachenko 
wrote:

> Hey guys,
>
> We've been having issues in the past couple of days with CPU usage / load
> average suddenly skyrocketing on some nodes of the cluster, affecting
> performance significantly so majority of requests start timing out. It can
> go on for several hours, with CPU spiking through the roof then coming back
> down to norm and so on. Weirdly, it affects only a subset of nodes and it's
> always the same ones. The boxes Cassandra is running on are pretty beefy,
> 24 cores, and these CPU spikes go up to >1000%.
>
> What is the best way to debug such kind of issues and find out what
> Cassandra is doing during spikes like this? Doesn't seem to be compaction
> related as sometimes during these spikes "nodetool compactionstats" says no
> compactions are running.
>
> Thanks!
>
>


Re: High CPU usage on some of nodes

2015-09-10 Thread Robert Wille
It sounds like its probably GC. Grep for GC in system.log to verify. If it is 
GC, there are a myriad of issues that could cause it, but at least you’ve 
narrowed it down.

On Sep 9, 2015, at 11:05 PM, Roman Tkachenko  wrote:

> Hey guys,
> 
> We've been having issues in the past couple of days with CPU usage / load 
> average suddenly skyrocketing on some nodes of the cluster, affecting 
> performance significantly so majority of requests start timing out. It can go 
> on for several hours, with CPU spiking through the roof then coming back down 
> to norm and so on. Weirdly, it affects only a subset of nodes and it's always 
> the same ones. The boxes Cassandra is running on are pretty beefy, 24 cores, 
> and these CPU spikes go up to >1000%.
> 
> What is the best way to debug such kind of issues and find out what Cassandra 
> is doing during spikes like this? Doesn't seem to be compaction related as 
> sometimes during these spikes "nodetool compactionstats" says no compactions 
> are running.
> 
> Thanks!
> 



Re: High CPU usage on some of nodes

2015-09-10 Thread Samuel CARRIERE
Hi Roman,
If it affects only a subset of nodes and it's always the same ones, it 
could be a "problem" with your data model : maybe some (too) wide rows on 
theses nodes.
If one of your row is too wide, the deserialisation of the columns index 
of this row can take a lot of resources (disk, RAM, and CPU).
If you are using leveled compaction strategy and you see anormaly big 
sstables on thoses nodes, it could be a clue.
Regards,
Samuel

Robert Wille <rwi...@fold3.com> a écrit sur 10/09/2015 15:27:41 :

> De : Robert Wille <rwi...@fold3.com>
> A : "user@cassandra.apache.org" <user@cassandra.apache.org>, 
> Date : 10/09/2015 15:30
> Objet : Re: High CPU usage on some of nodes
> 
> It sounds like its probably GC. Grep for GC in system.log to verify.
> If it is GC, there are a myriad of issues that could cause it, but 
> at least you?ve narrowed it down.
> 
> On Sep 9, 2015, at 11:05 PM, Roman Tkachenko <ro...@mailgunhq.com> 
wrote:
> 
> > Hey guys,
> > 
> > We've been having issues in the past couple of days with CPU usage
> / load average suddenly skyrocketing on some nodes of the cluster, 
> affecting performance significantly so majority of requests start 
> timing out. It can go on for several hours, with CPU spiking through
> the roof then coming back down to norm and so on. Weirdly, it 
> affects only a subset of nodes and it's always the same ones. The 
> boxes Cassandra is running on are pretty beefy, 24 cores, and these 
> CPU spikes go up to >1000%.
> > 
> > What is the best way to debug such kind of issues and find out 
> what Cassandra is doing during spikes like this? Doesn't seem to be 
> compaction related as sometimes during these spikes "nodetool 
> compactionstats" says no compactions are running.
> > 
> > Thanks!
> > 
> 


Re: High CPU usage on some of nodes

2015-09-10 Thread Roman Tkachenko
Thanks for the responses guys.

I also suspected GC and I guess it could be it, since during the spikes
logs are filled with messages like "GC for ConcurrentMarkSweep: 5908 ms for
1 collections, 1986282520 used; max is 8375238656", often right before
messages about dropped queries, unlike other, unaffected, nodes that only
have "GC for ParNew: 230 ms for 1 collections, 4418571760 used; max is
8375238656" type of messages.

Is my best shot to play with JVM settings trying to tune garbage collection
then?


On Thu, Sep 10, 2015 at 6:52 AM, Samuel CARRIERE <samuel.carri...@urssaf.fr>
wrote:

> Hi Roman,
> If it affects only a subset of nodes and it's always the same ones, it
> could be a "problem" with your data model : maybe some (too) wide rows on
> theses nodes.
> If one of your row is too wide, the deserialisation of the columns index
> of this row can take a lot of resources (disk, RAM, and CPU).
> If you are using leveled compaction strategy and you see anormaly big
> sstables on thoses nodes, it could be a clue.
> Regards,
> Samuel
>
> Robert Wille <rwi...@fold3.com> a écrit sur 10/09/2015 15:27:41 :
>
> > De : Robert Wille <rwi...@fold3.com>
> > A : "user@cassandra.apache.org" <user@cassandra.apache.org>,
> > Date : 10/09/2015 15:30
> > Objet : Re: High CPU usage on some of nodes
> >
> > It sounds like its probably GC. Grep for GC in system.log to verify.
> > If it is GC, there are a myriad of issues that could cause it, but
> > at least you’ve narrowed it down.
> >
> > On Sep 9, 2015, at 11:05 PM, Roman Tkachenko <ro...@mailgunhq.com>
> wrote:
> >
> > > Hey guys,
> > >
> > > We've been having issues in the past couple of days with CPU usage
> > / load average suddenly skyrocketing on some nodes of the cluster,
> > affecting performance significantly so majority of requests start
> > timing out. It can go on for several hours, with CPU spiking through
> > the roof then coming back down to norm and so on. Weirdly, it
> > affects only a subset of nodes and it's always the same ones. The
> > boxes Cassandra is running on are pretty beefy, 24 cores, and these
> > CPU spikes go up to >1000%.
> > >
> > > What is the best way to debug such kind of issues and find out
> > what Cassandra is doing during spikes like this? Doesn't seem to be
> > compaction related as sometimes during these spikes "nodetool
> > compactionstats" says no compactions are running.
> > >
> > > Thanks!
> > >
> >
>


Re: High CPU usage on some of nodes

2015-09-10 Thread Jeff Jirsa
With a 5s collection, the problem is almost certainly GC. 

GC pressure can be caused by a number of things, including normal read/write 
loads, but ALSO compaction calculation (pre-2.1.9 / #9882) and very large 
partitions (trying to load a very large partition with something like row cache 
in 2.0 and earlier, or issuing a full row read where the row is larger than you 
expect). 

You can try to tune the GC behavior, but the underlying problem may be 
something like a bad data model (which Samuel suggested), and no amount of GC 
tuning is going to fix trying to do bad things with very big rows. 



From:  Roman Tkachenko
Reply-To:  "user@cassandra.apache.org"
Date:  Thursday, September 10, 2015 at 10:54 AM
To:  "user@cassandra.apache.org"
Subject:  Re: High CPU usage on some of nodes

Thanks for the responses guys. 

I also suspected GC and I guess it could be it, since during the spikes logs 
are filled with messages like "GC for ConcurrentMarkSweep: 5908 ms for 1 
collections, 1986282520 used; max is 8375238656", often right before messages 
about dropped queries, unlike other, unaffected, nodes that only have "GC for 
ParNew: 230 ms for 1 collections, 4418571760 used; max is 8375238656" type of 
messages.

Is my best shot to play with JVM settings trying to tune garbage collection 
then?


On Thu, Sep 10, 2015 at 6:52 AM, Samuel CARRIERE <samuel.carri...@urssaf.fr> 
wrote:
Hi Roman, 
If it affects only a subset of nodes and it's always the same ones, it could be 
a "problem" with your data model : maybe some (too) wide rows on theses nodes.
If one of your row is too wide, the deserialisation of the columns index of 
this row can take a lot of resources (disk, RAM, and CPU).
If you are using leveled compaction strategy and you see anormaly big sstables 
on thoses nodes, it could be a clue.
Regards, 
Samuel 

Robert Wille <rwi...@fold3.com> a écrit sur 10/09/2015 15:27:41 :

> De : Robert Wille <rwi...@fold3.com>
> A : "user@cassandra.apache.org" <user@cassandra.apache.org>, 
> Date : 10/09/2015 15:30 
> Objet : Re: High CPU usage on some of nodes 
> 
> It sounds like its probably GC. Grep for GC in system.log to verify.
> If it is GC, there are a myriad of issues that could cause it, but 
> at least you’ve narrowed it down.
> 
> On Sep 9, 2015, at 11:05 PM, Roman Tkachenko <ro...@mailgunhq.com> wrote:
> 
> > Hey guys,
> > 
> > We've been having issues in the past couple of days with CPU usage
> / load average suddenly skyrocketing on some nodes of the cluster, 
> affecting performance significantly so majority of requests start 
> timing out. It can go on for several hours, with CPU spiking through
> the roof then coming back down to norm and so on. Weirdly, it 
> affects only a subset of nodes and it's always the same ones. The 
> boxes Cassandra is running on are pretty beefy, 24 cores, and these 
> CPU spikes go up to >1000%.
> > 
> > What is the best way to debug such kind of issues and find out 
> what Cassandra is doing during spikes like this? Doesn't seem to be 
> compaction related as sometimes during these spikes "nodetool 
> compactionstats" says no compactions are running.
> > 
> > Thanks!
> > 
> 




smime.p7s
Description: S/MIME cryptographic signature


Re: High CPU usage on some of nodes

2015-09-10 Thread Robert Coli
On Thu, Sep 10, 2015 at 10:54 AM, Roman Tkachenko 
wrote:
>
> [5 second CMS GC] Is my best shot to play with JVM settings trying to tune
> garbage collection then?
>

Yep. As a minor note, if the machines are that beefy, they probably have a
lot of RAM, you might wish to consider trying G1 GC and a larger heap.

=Rob


Re: High CPU usage on some of nodes

2015-09-10 Thread Graham Sanderson
Haven’t been following this thread, but we run beefy machines with 8gig new 
gen, 12 gig old gen (down from 16g since moving memtables off heap, we can 
probably go lower)…

Apart from making sure you have all the latest -XX: flags from cassandra-env.sh 
(and MALLOC_ARENA_MAX), I personally would recommend running latest 2.1.x with

memory_allocator: JEMallocAllocator
memtable_allocation_type: offheap_objects

Some people will probably disagree, but it works great for us (rare long pauses 
sub 2 secs), and if you’re seeing slow GC because of promotion failure of 
objects 131074 dwords big, then I definitely suggest you give it a try.

> On Sep 10, 2015, at 1:43 PM, Robert Coli  wrote:
> 
> On Thu, Sep 10, 2015 at 10:54 AM, Roman Tkachenko  > wrote: 
> [5 second CMS GC] Is my best shot to play with JVM settings trying to tune 
> garbage collection then?
> 
> Yep. As a minor note, if the machines are that beefy, they probably have a 
> lot of RAM, you might wish to consider trying G1 GC and a larger heap.
> 
> =Rob
> 
>  



smime.p7s
Description: S/MIME cryptographic signature


High CPU usage on some of nodes

2015-09-09 Thread Roman Tkachenko
Hey guys,

We've been having issues in the past couple of days with CPU usage / load
average suddenly skyrocketing on some nodes of the cluster, affecting
performance significantly so majority of requests start timing out. It can
go on for several hours, with CPU spiking through the roof then coming back
down to norm and so on. Weirdly, it affects only a subset of nodes and it's
always the same ones. The boxes Cassandra is running on are pretty beefy,
24 cores, and these CPU spikes go up to >1000%.

What is the best way to debug such kind of issues and find out what
Cassandra is doing during spikes like this? Doesn't seem to be compaction
related as sometimes during these spikes "nodetool compactionstats" says no
compactions are running.

Thanks!