Re: constant CMS GC using CPU time

2012-10-26 Thread aaron morton
 How does compaction_throughput relate to memory usage?  
It reduces the rate of memory allocation. 
e.g. Say normally ParNew can keep up with the rate of memory usage without 
stopping for too long: so the rate of promotion is low'ish and every thing is 
allocated to Eden. If the allocation rate gets higher ParNew may be more 
frequent and objects may be promoted to tenured that don't really need to be 
there.  

  I assumed that was more for IO tuning.  I noticed that lowering 
 concurrent_compactors to 4 (from default of 8) lowered the memory used during 
 compactions.
Similar thing to above. This may reduce the number of rows held in memory at 
any instant for compaction. 

Only rows less than in_memory_compaction_limit are loaded into memory during 
compaction. So reducing that may reduce the memory usage.

  Since then I've reduced the TTL to 1 hour and set gc_grace_seconds to 0 so 
 the number of rows and data dropped to a level it can handle.
Cool. Sorry if took so long to get there. 


-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 26/10/2012, at 8:08 AM, Bryan Talbot btal...@aeriagames.com wrote:

 On Thu, Oct 25, 2012 at 4:15 AM, aaron morton aa...@thelastpickle.com wrote:
 This sounds very much like my heap is so consumed by (mostly) bloom
 filters that I am in steady state GC thrash.
 
 Yes, I think that was at least part of the issue.
 
 The rough numbers I've used to estimate working set are:
 
 * bloom filter size for 400M rows at 0.00074 fp without java fudge (they are 
 just a big array) 714 MB
 * memtable size 1024 MB 
 * index sampling:
   *  24 bytes + key (16 bytes for UUID) = 32 bytes 
   * 400M / 128 default sampling = 3,125,000
   *  3,125,000 * 32 = 95 MB
   * java fudge X5 or X10 = 475MB to 950MB
 * ignoring row cache and key cache
  
 So the high side number is 2213 to 2,688. High because the fudge is a 
 delicious sticky guess and the memtable space would rarely be full. 
 
 On a 5120 MB heap, with 800MB new you have roughly  4300 MB tenured  (some 
 goes to perm) and 75% of that is 3,225 MB. Not terrible but it depends on the 
 working set and how quickly stuff get's tenured which depends on the 
 workload. 
 
 These values seem reasonable and in line with what I was seeing.  There are 
 other CF and apps sharing this cluster but this one was the largest.  
 
 
   
 
 You can confirm these guesses somewhat manually by enabling all the GC 
 logging in cassandra-env.sh. Restart the node and let it operate normally, 
 probably best to keep repair off.
 
 
 
 I was using jstat to monitor gc activity and some snippets from that are in 
 my original email in this thread.  The key behavior was that full gc was 
 running pretty often and never able to reclaim much (if any) space.
 
 
  
 
 There are a few things you could try:
 
 * increase the JVM heap by say 1Gb and see how it goes
 * increase bloom filter false positive,  try 0.1 first (see 
 http://www.datastax.com/docs/1.1/configuration/storage_configuration#bloom-filter-fp-chance)
  
 * increase index_interval sampling in yaml.  
 * decreasing compaction_throughput and in_memory_compaction_limit can lesson 
 the additional memory pressure compaction adds. 
 * disable caches or ensure off heap caches are used.
 
 I've done several of these already in addition to changing the app to reduce 
 the number of rows retained.  How does compaction_throughput relate to memory 
 usage?  I assumed that was more for IO tuning.  I noticed that lowering 
 concurrent_compactors to 4 (from default of 8) lowered the memory used during 
 compactions.  in_memory_compaction_limit_in_mb seems to only be used for wide 
 rows and this CF didn't have any wider than in_memory_compaction_limit_in_mb. 
  My multithreaded_compaction is still false.
 
  
 
 Watching the gc logs and the cassandra log is a great way to get a feel for 
 what works in your situation. Also take note of any scheduled processing your 
 app does which may impact things, and look for poorly performing queries. 
 
 Finally this book is a good reference on Java GC http://amzn.com/0137142528 
 
 For my understanding what was the average row size for the 400 million keys ? 
 
 
 
 The compacted row mean size for the CF is 8815 (as reported by cfstats) but 
 that comes out to be much larger than the real load per node I was seeing.  
 Each node had about 200GB of data for the CF with 4 nodes in the cluster and 
 RF=3.  At the time, the TTL for all columns was 3 days and gc_grace_seconds 
 was 5 days.  Since then I've reduced the TTL to 1 hour and set 
 gc_grace_seconds to 0 so the number of rows and data dropped to a level it 
 can handle.
 
 
 -Bryan
 



Re: constant CMS GC using CPU time

2012-10-25 Thread aaron morton
 Regarding memory usage after a repair ... Are the merkle trees kept around?
 

They should not be.

Cheers


-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 24/10/2012, at 4:51 PM, B. Todd Burruss bto...@gmail.com wrote:

 Regarding memory usage after a repair ... Are the merkle trees kept around?
 
 On Oct 23, 2012 3:00 PM, Bryan Talbot btal...@aeriagames.com wrote:
 On Mon, Oct 22, 2012 at 6:05 PM, aaron morton aa...@thelastpickle.com wrote:
 The GC was on-going even when the nodes were not compacting or running a 
 heavy application load -- even when the main app was paused constant the GC 
 continued.
 If you restart a node is the onset of GC activity correlated to some event?
 
 Yes and no.  When the nodes were generally under the .75 occupancy threshold 
 a weekly repair -pr job would cause them to go over the threshold and then 
 stay there even after the repair had completed and there were no ongoing 
 compactions.  It acts as though at least some substantial amount of memory 
 used during repair was never dereferenced once the repair was complete.
 
 Once one CF in particular grew larger the constant GC would start up pretty 
 soon (less than 90 minutes) after a node restart even without a repair.
 
 
  
  
 As a test we dropped the largest CF and the memory usage immediately dropped 
 to acceptable levels and the constant GC stopped.  So it's definitely 
 related to data load.  memtable size is 1 GB, row cache is disabled and key 
 cache is small (default).
 How many keys did the CF have per node? 
 I dismissed the memory used to  hold bloom filters and index sampling. That 
 memory is not considered part of the memtable size, and will end up in the 
 tenured heap. It is generally only a problem with very large key counts per 
 node. 
 
 
 I've changed the app to retain less data for that CF but I think that it was 
 about 400M rows per node.  Row keys are a TimeUUID.  All of the rows are 
 write-once, never updated, and rarely read.  There are no secondary indexes 
 for this particular CF.
 
 
  
  They were 2+ GB (as reported by nodetool cfstats anyway).  It looks like 
 the default bloom_filter_fp_chance defaults to 0.0 
 The default should be 0.000744.
 
 If the chance is zero or null this code should run when a new SSTable is 
 written 
   // paranoia -- we've had bugs in the thrift - avro - CfDef dance 
 before, let's not let that break things
 logger.error(Bloom filter FP chance of zero isn't supposed 
 to happen);
 
 Were the CF's migrated from an old version ?
 
 
 Yes, the CF were created in 1.0.9, then migrated to 1.0.11 and finally to 
 1.1.5 with a upgradesstables run at each upgrade along the way.
 
 I could not find a way to view the current bloom_filter_fp_chance settings 
 when they are at a default value.  JMX reports the actual fp rate and if a 
 specific rate is set for a CF that shows up in describe table but I 
 couldn't find out how to tell what the default was.  I didn't inspect the 
 source.
 
  
 Is there any way to predict how much memory the bloom filters will consume 
 if the size of the row keys, number or rows is known, and fp chance is known?
 
 See o.a.c.utils.BloomFilter.getFilter() in the code 
 This http://hur.st/bloomfilter appears to give similar results. 
 
 
 
 
 Ahh, very helpful.  This indicates that 714MB would be used for the bloom 
 filter for that one CF.
 
 JMX / cfstats reports Bloom Filter Space Used but the MBean method name 
 (getBloomFilterDiskSpaceUsed) indicates this is the on-disk space. If on-disk 
 and in-memory space used is similar then summing up all the Bloom Filter 
 Space Used says they're currently consuming 1-2 GB of the heap which is 
 substantial.
 
 If a CF is rarely read is it safe to set bloom_filter_fp_chance to 1.0?  It 
 just means more trips to SSTable indexes for a read correct?  Trade RAM for 
 time (disk I/O).
 
 -Bryan
 



Re: constant CMS GC using CPU time

2012-10-25 Thread aaron morton
 This sounds very much like my heap is so consumed by (mostly) bloom
 filters that I am in steady state GC thrash.
 
 Yes, I think that was at least part of the issue.

The rough numbers I've used to estimate working set are:

* bloom filter size for 400M rows at 0.00074 fp without java fudge (they are 
just a big array) 714 MB
* memtable size 1024 MB 
* index sampling:
*  24 bytes + key (16 bytes for UUID) = 32 bytes 
* 400M / 128 default sampling = 3,125,000
*  3,125,000 * 32 = 95 MB
* java fudge X5 or X10 = 475MB to 950MB
* ignoring row cache and key cache
 
So the high side number is 2213 to 2,688. High because the fudge is a delicious 
sticky guess and the memtable space would rarely be full. 

On a 5120 MB heap, with 800MB new you have roughly  4300 MB tenured  (some goes 
to perm) and 75% of that is 3,225 MB. Not terrible but it depends on the 
working set and how quickly stuff get's tenured which depends on the workload. 

You can confirm these guesses somewhat manually by enabling all the GC logging 
in cassandra-env.sh. Restart the node and let it operate normally, probably 
best to keep repair off.

This is a sample (partial) GC log before CMS kicks in note  concurrent 
mark-sweep used size. Your values may differ, this has non default settings:

Heap after GC invocations=9947 (full 182):
 par new generation   total 1024000K, used 101882K [0x0006fae0, 
0x000745e0, 0x000745e0)
  eden space 819200K,   0% used [0x0006fae0, 0x0006fae0, 
0x00072ce0)
  from space 204800K,  49% used [0x00073960, 0x00073f97eaf8, 
0x000745e0)
  to   space 204800K,   0% used [0x00072ce0, 0x00072ce0, 
0x00073960)
 concurrent mark-sweep generation total 2965504K, used 2309885K 
[0x000745e0, 0x0007fae0, 0x0007fae0)
 concurrent-mark-sweep perm gen total 38052K, used 22811K [0x0007fae0, 
0x0007fd329000, 0x0008)
}

This is when it starts, see (full X) count increases :

2012-10-25T03:32:44.664-0500: 76691.891: [GC [1 CMS-initial-mark: 
2309885K(2965504K)] 2411929K(3989504K), 0.0047910 secs] [Times: user=0.01 
sys=0.00, real=0.01 secs] 
Total time for which application threads were stopped: 0.0059850 seconds

(other CMS type logs)

{Heap before GC invocations=9947 (full 183):
 par new generation   total 1024000K, used 921082K [0x0006fae0, 
0x000745e0, 0x000745e0)
  eden space 819200K, 100% used [0x0006fae0, 0x00072ce0, 
0x00072ce0)
  from space 204800K,  49% used [0x00073960, 0x00073f97eaf8, 
0x000745e0)
  to   space 204800K,   0% used [0x00072ce0, 0x00072ce0, 
0x00073960)
 concurrent mark-sweep generation total 2965504K, used 2206292K 
[0x000745e0, 0x0007fae0, 0x0007fae0)
 concurrent-mark-sweep perm gen total 38052K, used 22811K [0x0007fae0, 
0x0007fd329000, 0x0008)

A couple of log messages later concurrent mark-sweep used size is down:

{Heap before GC invocations=9948 (full 183):
 par new generation   total 1024000K, used 938695K [0x0006fae0, 
0x000745e0, 0x000745e0)
  eden space 819200K, 100% used [0x0006fae0, 0x00072ce0, 
0x00072ce0)
  from space 204800K,  58% used [0x00072ce0, 0x0007342b1f00, 
0x00073960)
  to   space 204800K,   0% used [0x00073960, 0x00073960, 
0x000745e0)
 concurrent mark-sweep generation total 2965504K, used 1096146K 
[0x000745e0, 0x0007fae0, 0x0007fae0)
 concurrent-mark-sweep perm gen total 38052K, used 22811K [0x0007fae0, 
0x0007fd329000, 0x0008)
2012-10-25T03:32:50.479-0500: 76697.706: [GC Before GC:


There are a few things you could try:

* increase the JVM heap by say 1Gb and see how it goes
* increase bloom filter false positive,  try 0.1 first (see 
http://www.datastax.com/docs/1.1/configuration/storage_configuration#bloom-filter-fp-chance)
 
* increase index_interval sampling in yaml.  
* decreasing compaction_throughput and in_memory_compaction_limit can lesson 
the additional memory pressure compaction adds. 
* disable caches or ensure off heap caches are used.

Watching the gc logs and the cassandra log is a great way to get a feel for 
what works in your situation. Also take note of any scheduled processing your 
app does which may impact things, and look for poorly performing queries. 

Finally this book is a good reference on Java GC http://amzn.com/0137142528 

For my understanding what was the average row size for the 400 million keys ? 

Hope that helps. 

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 25/10/2012, at 1:22 PM, Bryan Talbot btal...@aeriagames.com wrote:

 On Wed, Oct 24, 2012 at 2:38 PM, Rob Coli rc...@palominodb.com wrote:
 On Mon, Oct 22, 2012 at 8:38 AM, Bryan Talbot 

Re: constant CMS GC using CPU time

2012-10-25 Thread Bryan Talbot
On Thu, Oct 25, 2012 at 4:15 AM, aaron morton aa...@thelastpickle.comwrote:

  This sounds very much like my heap is so consumed by (mostly) bloom
 filters that I am in steady state GC thrash.


 Yes, I think that was at least part of the issue.


 The rough numbers I've used to estimate working set are:

 * bloom filter size for 400M rows at 0.00074 fp without java fudge (they
 are just a big array) 714 MB
 * memtable size 1024 MB
 * index sampling:
 *  24 bytes + key (16 bytes for UUID) = 32 bytes
  * 400M / 128 default sampling = 3,125,000
 *  3,125,000 * 32 = 95 MB
  * java fudge X5 or X10 = 475MB to 950MB
 * ignoring row cache and key cache

 So the high side number is 2213 to 2,688. High because the fudge is a
 delicious sticky guess and the memtable space would rarely be full.

 On a 5120 MB heap, with 800MB new you have roughly  4300 MB tenured  (some
 goes to perm) and 75% of that is 3,225 MB. Not terrible but it depends on
 the working set and how quickly stuff get's tenured which depends on the
 workload.


These values seem reasonable and in line with what I was seeing.  There are
other CF and apps sharing this cluster but this one was the largest.





 You can confirm these guesses somewhat manually by enabling all the GC
 logging in cassandra-env.sh. Restart the node and let it operate normally,
 probably best to keep repair off.



I was using jstat to monitor gc activity and some snippets from that are in
my original email in this thread.  The key behavior was that full gc was
running pretty often and never able to reclaim much (if any) space.





 There are a few things you could try:

 * increase the JVM heap by say 1Gb and see how it goes
 * increase bloom filter false positive,  try 0.1 first (see
 http://www.datastax.com/docs/1.1/configuration/storage_configuration#bloom-filter-fp-chance
 )
 * increase index_interval sampling in yaml.
 * decreasing compaction_throughput and in_memory_compaction_limit can
 lesson the additional memory pressure compaction adds.
 * disable caches or ensure off heap caches are used.


I've done several of these already in addition to changing the app to
reduce the number of rows retained.  How does compaction_throughput relate
to memory usage?  I assumed that was more for IO tuning.  I noticed that
lowering concurrent_compactors to 4 (from default of 8) lowered the memory
used during compactions.  in_memory_compaction_limit_in_mb seems to only be
used for wide rows and this CF didn't have any wider
than in_memory_compaction_limit_in_mb.  My multithreaded_compaction is
still false.




 Watching the gc logs and the cassandra log is a great way to get a feel
 for what works in your situation. Also take note of any scheduled
 processing your app does which may impact things, and look for poorly
 performing queries.

 Finally this book is a good reference on Java GC
 http://amzn.com/0137142528

 For my understanding what was the average row size for the 400 million
 keys ?



The compacted row mean size for the CF is 8815 (as reported by cfstats) but
that comes out to be much larger than the real load per node I was seeing.
 Each node had about 200GB of data for the CF with 4 nodes in the cluster
and RF=3.  At the time, the TTL for all columns was 3 days and
gc_grace_seconds was 5 days.  Since then I've reduced the TTL to 1 hour and
set gc_grace_seconds to 0 so the number of rows and data dropped to a level
it can handle.


-Bryan


Re: constant CMS GC using CPU time

2012-10-24 Thread Rob Coli
On Mon, Oct 22, 2012 at 8:38 AM, Bryan Talbot btal...@aeriagames.com wrote:
 The nodes with the most data used the most memory.  All nodes are affected
 eventually not just one.  The GC was on-going even when the nodes were not
 compacting or running a heavy application load -- even when the main app was
 paused constant the GC continued.

This sounds very much like my heap is so consumed by (mostly) bloom
filters that I am in steady state GC thrash.

Do you have heap graphs which show a healthy sawtooth GC cycle which
then more or less flatlines?

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: constant CMS GC using CPU time

2012-10-24 Thread Bryan Talbot
On Wed, Oct 24, 2012 at 2:38 PM, Rob Coli rc...@palominodb.com wrote:

 On Mon, Oct 22, 2012 at 8:38 AM, Bryan Talbot btal...@aeriagames.com
 wrote:
  The nodes with the most data used the most memory.  All nodes are
 affected
  eventually not just one.  The GC was on-going even when the nodes were
 not
  compacting or running a heavy application load -- even when the main app
 was
  paused constant the GC continued.

 This sounds very much like my heap is so consumed by (mostly) bloom
 filters that I am in steady state GC thrash.


Yes, I think that was at least part of the issue.




 Do you have heap graphs which show a healthy sawtooth GC cycle which
 then more or less flatlines?



I didn't save any graphs but that is what they would look like.  I was
using jstat to monitor gc activity.

-Bryan


Re: constant CMS GC using CPU time

2012-10-23 Thread Bryan Talbot
These GC settings are the default (recommended?) settings from
cassandra-env.  I added the UseCompressedOops.

-Bryan


On Mon, Oct 22, 2012 at 6:15 PM, Will @ SOHO w...@voodoolunchbox.comwrote:

  On 10/22/2012 09:05 PM, aaron morton wrote:

  # GC tuning options
 JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC
 JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC
 JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled
 JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8
 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=1
  JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75
 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly
 JVM_OPTS=$JVM_OPTS -XX:+UseCompressedOops

  You are too far behind the reference JVM's. Parallel GC is the preferred
 and highest performing form in the current Security Baseline version of the
 JVM's.




-- 
Bryan Talbot
Architect / Platform team lead, Aeria Games and Entertainment
Silicon Valley | Berlin | Tokyo | Sao Paulo


Re: constant CMS GC using CPU time

2012-10-23 Thread Bryan Talbot
On Mon, Oct 22, 2012 at 6:05 PM, aaron morton aa...@thelastpickle.comwrote:

 The GC was on-going even when the nodes were not compacting or running a
 heavy application load -- even when the main app was paused constant the GC
 continued.

 If you restart a node is the onset of GC activity correlated to some event?


Yes and no.  When the nodes were generally under the
.75 occupancy threshold a weekly repair -pr job would cause them to go
over the threshold and then stay there even after the repair had completed
and there were no ongoing compactions.  It acts as though at least some
substantial amount of memory used during repair was never dereferenced once
the repair was complete.

Once one CF in particular grew larger the constant GC would start up pretty
soon (less than 90 minutes) after a node restart even without a repair.






 As a test we dropped the largest CF and the memory
 usage immediately dropped to acceptable levels and the constant GC stopped.
  So it's definitely related to data load.  memtable size is 1 GB, row cache
 is disabled and key cache is small (default).

 How many keys did the CF have per node?
 I dismissed the memory used to  hold bloom filters and index sampling.
 That memory is not considered part of the memtable size, and will end up in
 the tenured heap. It is generally only a problem with very large key counts
 per node.


I've changed the app to retain less data for that CF but I think that it
was about 400M rows per node.  Row keys are a TimeUUID.  All of the rows
are write-once, never updated, and rarely read.  There are no secondary
indexes for this particular CF.




  They were 2+ GB (as reported by nodetool cfstats anyway).  It looks like
 the default bloom_filter_fp_chance defaults to 0.0

 The default should be 0.000744.

 If the chance is zero or null this code should run when a new SSTable is
 written
   // paranoia -- we've had bugs in the thrift - avro - CfDef dance
 before, let's not let that break things
 logger.error(Bloom filter FP chance of zero isn't
 supposed to happen);

 Were the CF's migrated from an old version ?


Yes, the CF were created in 1.0.9, then migrated to 1.0.11 and finally to
1.1.5 with a upgradesstables run at each upgrade along the way.

I could not find a way to view the current bloom_filter_fp_chance settings
when they are at a default value.  JMX reports the actual fp rate and if a
specific rate is set for a CF that shows up in describe table but I
couldn't find out how to tell what the default was.  I didn't inspect the
source.



 Is there any way to predict how much memory the bloom filters will consume
 if the size of the row keys, number or rows is known, and fp chance is
 known?


 See o.a.c.utils.BloomFilter.getFilter() in the code
 This http://hur.st/bloomfilter appears to give similar results.




Ahh, very helpful.  This indicates that 714MB would be used for the bloom
filter for that one CF.

JMX / cfstats reports Bloom Filter Space Used but the MBean method name
(getBloomFilterDiskSpaceUsed) indicates this is the on-disk space. If
on-disk and in-memory space used is similar then summing up all the Bloom
Filter Space Used says they're currently consuming 1-2 GB of the heap
which is substantial.

If a CF is rarely read is it safe to set bloom_filter_fp_chance to 1.0?  It
just means more trips to SSTable indexes for a read correct?  Trade RAM for
time (disk I/O).

-Bryan


Re: constant CMS GC using CPU time

2012-10-23 Thread B. Todd Burruss
Regarding memory usage after a repair ... Are the merkle trees kept around?
On Oct 23, 2012 3:00 PM, Bryan Talbot btal...@aeriagames.com wrote:

 On Mon, Oct 22, 2012 at 6:05 PM, aaron morton aa...@thelastpickle.comwrote:

 The GC was on-going even when the nodes were not compacting or running a
 heavy application load -- even when the main app was paused constant the GC
 continued.

 If you restart a node is the onset of GC activity correlated to some
 event?


 Yes and no.  When the nodes were generally under the
 .75 occupancy threshold a weekly repair -pr job would cause them to go
 over the threshold and then stay there even after the repair had completed
 and there were no ongoing compactions.  It acts as though at least some
 substantial amount of memory used during repair was never dereferenced once
 the repair was complete.

 Once one CF in particular grew larger the constant GC would start up
 pretty soon (less than 90 minutes) after a node restart even without a
 repair.






 As a test we dropped the largest CF and the memory
 usage immediately dropped to acceptable levels and the constant GC stopped.
  So it's definitely related to data load.  memtable size is 1 GB, row cache
 is disabled and key cache is small (default).

 How many keys did the CF have per node?
 I dismissed the memory used to  hold bloom filters and index sampling.
 That memory is not considered part of the memtable size, and will end up in
 the tenured heap. It is generally only a problem with very large key counts
 per node.


 I've changed the app to retain less data for that CF but I think that it
 was about 400M rows per node.  Row keys are a TimeUUID.  All of the rows
 are write-once, never updated, and rarely read.  There are no secondary
 indexes for this particular CF.




  They were 2+ GB (as reported by nodetool cfstats anyway).  It looks like
 the default bloom_filter_fp_chance defaults to 0.0

 The default should be 0.000744.

 If the chance is zero or null this code should run when a new SSTable is
 written
   // paranoia -- we've had bugs in the thrift - avro - CfDef dance
 before, let's not let that break things
 logger.error(Bloom filter FP chance of zero isn't
 supposed to happen);

 Were the CF's migrated from an old version ?


 Yes, the CF were created in 1.0.9, then migrated to 1.0.11 and finally to
 1.1.5 with a upgradesstables run at each upgrade along the way.

 I could not find a way to view the current bloom_filter_fp_chance settings
 when they are at a default value.  JMX reports the actual fp rate and if a
 specific rate is set for a CF that shows up in describe table but I
 couldn't find out how to tell what the default was.  I didn't inspect the
 source.



 Is there any way to predict how much memory the bloom filters will
 consume if the size of the row keys, number or rows is known, and fp chance
 is known?


 See o.a.c.utils.BloomFilter.getFilter() in the code
 This http://hur.st/bloomfilter appears to give similar results.




 Ahh, very helpful.  This indicates that 714MB would be used for the bloom
 filter for that one CF.

 JMX / cfstats reports Bloom Filter Space Used but the MBean method name
 (getBloomFilterDiskSpaceUsed) indicates this is the on-disk space. If
 on-disk and in-memory space used is similar then summing up all the Bloom
 Filter Space Used says they're currently consuming 1-2 GB of the heap
 which is substantial.

 If a CF is rarely read is it safe to set bloom_filter_fp_chance to 1.0?
  It just means more trips to SSTable indexes for a read correct?  Trade RAM
 for time (disk I/O).

 -Bryan




Re: constant CMS GC using CPU time

2012-10-22 Thread aaron morton
If you are using the default settings I would try to correlate the GC activity 
with some application activity before tweaking.

If this is happening on one machine out of 4 ensure that client load is 
distributed evenly. 

See if the raise in GC activity us related to Compaction, repair or an increase 
in throughput. OpsCentre or some other monitoring can help with the last one. 
Your mention of TTL makes me think compaction may be doing a bit of work 
churning through rows. 
  
Some things I've done in the past before looking at heap settings:
* reduce compaction_throughput to reduce the memory churn
* reduce in_memory_compaction_limit 
* if needed reduce concurrent_compactors

 Currently it seems like the memory used scales with the amount of bytes 
 stored and not with how busy the server actually is.  That's not such a good 
 thing.
The memtable_total_space_in_mb in yaml tells C* how much memory to devote to 
the memtables. That with the global row cache setting says how much memory will 
be used with regard to storing data and it will not increase inline with the 
static data load.

Now days GC issues are typically due to more dynamic forces, like compaction, 
repair and throughput. 
 
Hope that helps. 

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 20/10/2012, at 6:59 AM, Bryan Talbot btal...@aeriagames.com wrote:

 ok, let me try asking the question a different way ...
 
 How does cassandra use memory and how can I plan how much is needed?  I have 
 a 1 GB memtable and 5 GB total heap and that's still not enough even though 
 the number of concurrent connections and garbage generation rate is fairly 
 low.
 
 If I were using mysql or oracle, I could compute how much memory could be 
 used by N concurrent connections, how much is allocated for caching, temp 
 spaces, etc.  How can I do this for cassandra?  Currently it seems like the 
 memory used scales with the amount of bytes stored and not with how busy the 
 server actually is.  That's not such a good thing.
 
 -Bryan
 
 
 
 On Thu, Oct 18, 2012 at 11:06 AM, Bryan Talbot btal...@aeriagames.com wrote:
 In a 4 node cluster running Cassandra 1.1.5 with sun jvm 1.6.0_29-b11 
 (64-bit), the nodes are often getting stuck in state where CMS collections 
 of the old space are constantly running.  
 
 The JVM configuration is using the standard settings in cassandra-env -- 
 relevant settings are included below.  The max heap is currently set to 5 GB 
 with 800MB for new size.  I don't believe that the cluster is overly busy and 
 seems to be performing well enough other than this issue.  When nodes get 
 into this state they never seem to leave it (by freeing up old space memory) 
 without restarting cassandra.  They typically enter this state while running 
 nodetool repair -pr but once they start doing this, restarting them only 
 fixes it for a couple of hours.
 
 Compactions are completing and are generally not queued up.  All CF are using 
 STCS.  The busiest CF consumes about 100GB of space on disk, is write heavy, 
 and all columns have a TTL of 3 days.  Overall, there are 41 CF including 
 those used for system keyspace and secondary indexes.  The number of SSTables 
 per node currently varies from 185-212.
 
 Other than frequent log warnings about GCInspector  - Heap is 0.xxx full... 
 and StorageService  - Flushing CFS(...) to relieve memory pressure there 
 are no other log entries to indicate there is a problem.
 
 Does the memory needed vary depending on the amount of data stored?  If so, 
 how can I predict how much jvm space is needed?  I don't want to make the 
 heap too large as that's bad too.  Maybe there's a memory leak related to 
 compaction that doesn't allow meta-data to be purged?
 
 
 -Bryan
 
 
 12 GB of RAM in host with ~6 GB used by java and ~6 GB for OS and buffer 
 cache.
 $ free -m
  total   used   free sharedbuffers cached
 Mem: 12001  11870131  0  4   5778
 -/+ buffers/cache:   6087   5914
 Swap:0  0  0
 
 
 jvm settings in cassandra-env
 MAX_HEAP_SIZE=5G
 HEAP_NEWSIZE=800M
 
 # GC tuning options
 JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC 
 JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC 
 JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled 
 JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 
 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=1
 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75
 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly
 JVM_OPTS=$JVM_OPTS -XX:+UseCompressedOops
 
 
 jstat shows about 12 full collections per minute with old heap usage 
 constantly over 75% so CMS is always over the CMSInitiatingOccupancyFraction 
 threshold.
 
 $ jstat -gcutil -t 22917 5000 4
 Timestamp S0 S1 E  O  P YGC YGCTFGC
 FGCT GCT   
132063.0  34.70   0.00  26.03  82.29  59.88  21580  506.887 17523 
 3078.941 3585.829

Re: constant CMS GC using CPU time

2012-10-22 Thread Bryan Talbot
The memory usage was correlated with the size of the data set.  The nodes
were a bit unbalanced which is normal due to variations in compactions.
 The nodes with the most data used the most memory.  All nodes are affected
eventually not just one.  The GC was on-going even when the nodes were not
compacting or running a heavy application load -- even when the main app
was paused constant the GC continued.

As a test we dropped the largest CF and the memory
usage immediately dropped to acceptable levels and the constant GC stopped.
 So it's definitely related to data load.  memtable size is 1 GB, row cache
is disabled and key cache is small (default).

I believe one culprit turned out to be the bloom filters.  They were 2+ GB
(as reported by nodetool cfstats anyway).  It looks like the default
bloom_filter_fp_chance defaults to 0.0 even though guides recommend 0.10 as
the minimum value.  Raising that to 0.20 for some write-mostly CF reduced
memory used by 1GB or so.

Is there any way to predict how much memory the bloom filters will consume
if the size of the row keys, number or rows is known, and fp chance is
known?

-Bryan



On Mon, Oct 22, 2012 at 12:25 AM, aaron morton aa...@thelastpickle.comwrote:

 If you are using the default settings I would try to correlate the GC
 activity with some application activity before tweaking.

 If this is happening on one machine out of 4 ensure that client load is
 distributed evenly.

 See if the raise in GC activity us related to Compaction, repair or an
 increase in throughput. OpsCentre or some other monitoring can help with
 the last one. Your mention of TTL makes me think compaction may be doing a
 bit of work churning through rows.

 Some things I've done in the past before looking at heap settings:
 * reduce compaction_throughput to reduce the memory churn
 * reduce in_memory_compaction_limit
 * if needed reduce concurrent_compactors

 Currently it seems like the memory used scales with the amount of bytes
 stored and not with how busy the server actually is.  That's not such a
 good thing.

 The memtable_total_space_in_mb in yaml tells C* how much memory to devote
 to the memtables. That with the global row cache setting says how much
 memory will be used with regard to storing data and it will not increase
 inline with the static data load.

 Now days GC issues are typically due to more dynamic forces, like
 compaction, repair and throughput.

 Hope that helps.

 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 20/10/2012, at 6:59 AM, Bryan Talbot btal...@aeriagames.com wrote:

 ok, let me try asking the question a different way ...

 How does cassandra use memory and how can I plan how much is needed?  I
 have a 1 GB memtable and 5 GB total heap and that's still not enough even
 though the number of concurrent connections and garbage generation rate is
 fairly low.

 If I were using mysql or oracle, I could compute how much memory could be
 used by N concurrent connections, how much is allocated for caching, temp
 spaces, etc.  How can I do this for cassandra?  Currently it seems like the
 memory used scales with the amount of bytes stored and not with how busy
 the server actually is.  That's not such a good thing.

 -Bryan



 On Thu, Oct 18, 2012 at 11:06 AM, Bryan Talbot btal...@aeriagames.comwrote:

 In a 4 node cluster running Cassandra 1.1.5 with sun jvm 1.6.0_29-b11
 (64-bit), the nodes are often getting stuck in state where CMS
 collections of the old space are constantly running.

 The JVM configuration is using the standard settings in cassandra-env --
 relevant settings are included below.  The max heap is currently set to 5
 GB with 800MB for new size.  I don't believe that the cluster is overly
 busy and seems to be performing well enough other than this issue.  When
 nodes get into this state they never seem to leave it (by freeing up old
 space memory) without restarting cassandra.  They typically enter this
 state while running nodetool repair -pr but once they start doing this,
 restarting them only fixes it for a couple of hours.

 Compactions are completing and are generally not queued up.  All CF are
 using STCS.  The busiest CF consumes about 100GB of space on disk, is write
 heavy, and all columns have a TTL of 3 days.  Overall, there are 41 CF
 including those used for system keyspace and secondary indexes.  The number
 of SSTables per node currently varies from 185-212.

 Other than frequent log warnings about GCInspector  - Heap is 0.xxx
 full... and StorageService  - Flushing CFS(...) to relieve memory
 pressure there are no other log entries to indicate there is a problem.

 Does the memory needed vary depending on the amount of data stored?  If
 so, how can I predict how much jvm space is needed?  I don't want to make
 the heap too large as that's bad too.  Maybe there's a memory leak related
 to compaction that doesn't allow meta-data to be purged?


 -Bryan


 12 GB of RAM 

Re: constant CMS GC using CPU time

2012-10-22 Thread aaron morton
 The GC was on-going even when the nodes were not compacting or running a 
 heavy application load -- even when the main app was paused constant the GC 
 continued.
If you restart a node is the onset of GC activity correlated to some event?
 
 As a test we dropped the largest CF and the memory usage immediately dropped 
 to acceptable levels and the constant GC stopped.  So it's definitely related 
 to data load.  memtable size is 1 GB, row cache is disabled and key cache is 
 small (default).
How many keys did the CF have per node? 
I dismissed the memory used to  hold bloom filters and index sampling. That 
memory is not considered part of the memtable size, and will end up in the 
tenured heap. It is generally only a problem with very large key counts per 
node. 

  They were 2+ GB (as reported by nodetool cfstats anyway).  It looks like the 
 default bloom_filter_fp_chance defaults to 0.0 
The default should be 0.000744.

If the chance is zero or null this code should run when a new SSTable is 
written 
  // paranoia -- we've had bugs in the thrift - avro - CfDef dance before, 
let's not let that break things
logger.error(Bloom filter FP chance of zero isn't supposed to 
happen);

Were the CF's migrated from an old version ?

 Is there any way to predict how much memory the bloom filters will consume if 
 the size of the row keys, number or rows is known, and fp chance is known?

See o.a.c.utils.BloomFilter.getFilter() in the code 
This http://hur.st/bloomfilter appears to give similar results. 

Cheers
 

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 23/10/2012, at 4:38 AM, Bryan Talbot btal...@aeriagames.com wrote:

 The memory usage was correlated with the size of the data set.  The nodes 
 were a bit unbalanced which is normal due to variations in compactions.  The 
 nodes with the most data used the most memory.  All nodes are affected 
 eventually not just one.  The GC was on-going even when the nodes were not 
 compacting or running a heavy application load -- even when the main app was 
 paused constant the GC continued.
 
 As a test we dropped the largest CF and the memory usage immediately dropped 
 to acceptable levels and the constant GC stopped.  So it's definitely related 
 to data load.  memtable size is 1 GB, row cache is disabled and key cache is 
 small (default).
 
 I believe one culprit turned out to be the bloom filters.  They were 2+ GB 
 (as reported by nodetool cfstats anyway).  It looks like the default 
 bloom_filter_fp_chance defaults to 0.0 even though guides recommend 0.10 as 
 the minimum value.  Raising that to 0.20 for some write-mostly CF reduced 
 memory used by 1GB or so.
 
 Is there any way to predict how much memory the bloom filters will consume if 
 the size of the row keys, number or rows is known, and fp chance is known?
 
 -Bryan
 
 
 
 On Mon, Oct 22, 2012 at 12:25 AM, aaron morton aa...@thelastpickle.com 
 wrote:
 If you are using the default settings I would try to correlate the GC 
 activity with some application activity before tweaking.
 
 If this is happening on one machine out of 4 ensure that client load is 
 distributed evenly. 
 
 See if the raise in GC activity us related to Compaction, repair or an 
 increase in throughput. OpsCentre or some other monitoring can help with the 
 last one. Your mention of TTL makes me think compaction may be doing a bit of 
 work churning through rows. 
   
 Some things I've done in the past before looking at heap settings:
 * reduce compaction_throughput to reduce the memory churn
 * reduce in_memory_compaction_limit 
 * if needed reduce concurrent_compactors
 
 Currently it seems like the memory used scales with the amount of bytes 
 stored and not with how busy the server actually is.  That's not such a good 
 thing.
 The memtable_total_space_in_mb in yaml tells C* how much memory to devote to 
 the memtables. That with the global row cache setting says how much memory 
 will be used with regard to storing data and it will not increase inline 
 with the static data load.
 
 Now days GC issues are typically due to more dynamic forces, like compaction, 
 repair and throughput. 
  
 Hope that helps. 
 
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 20/10/2012, at 6:59 AM, Bryan Talbot btal...@aeriagames.com wrote:
 
 ok, let me try asking the question a different way ...
 
 How does cassandra use memory and how can I plan how much is needed?  I have 
 a 1 GB memtable and 5 GB total heap and that's still not enough even though 
 the number of concurrent connections and garbage generation rate is fairly 
 low.
 
 If I were using mysql or oracle, I could compute how much memory could be 
 used by N concurrent connections, how much is allocated for caching, temp 
 spaces, etc.  How can I do this for cassandra?  Currently it seems like the 
 memory used scales with the amount of bytes stored and not 

Re: constant CMS GC using CPU time

2012-10-22 Thread Will @ SOHO

On 10/22/2012 09:05 PM, aaron morton wrote:

# GC tuning options
JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC
JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC
JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled
JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8
JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=1
JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75
JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly
JVM_OPTS=$JVM_OPTS -XX:+UseCompressedOops

You are too far behind the reference JVM's. Parallel GC is the preferred 
and highest performing form in the current Security Baseline version of 
the JVM's.


Re: constant CMS GC using CPU time

2012-10-21 Thread Radim Kolar

Dne 18.10.2012 20:06, Bryan Talbot napsal(a):
In a 4 node cluster running Cassandra 1.1.5 with sun jvm 1.6.0_29-b11 
(64-bit), the nodes are often getting stuck in state where CMS 
collections of the old space are constantly running.

you need more java heap memory


Re: constant CMS GC using CPU time

2012-10-19 Thread Bryan Talbot
ok, let me try asking the question a different way ...

How does cassandra use memory and how can I plan how much is needed?  I
have a 1 GB memtable and 5 GB total heap and that's still not enough even
though the number of concurrent connections and garbage generation rate is
fairly low.

If I were using mysql or oracle, I could compute how much memory could be
used by N concurrent connections, how much is allocated for caching, temp
spaces, etc.  How can I do this for cassandra?  Currently it seems like the
memory used scales with the amount of bytes stored and not with how busy
the server actually is.  That's not such a good thing.

-Bryan



On Thu, Oct 18, 2012 at 11:06 AM, Bryan Talbot btal...@aeriagames.comwrote:

 In a 4 node cluster running Cassandra 1.1.5 with sun jvm 1.6.0_29-b11
 (64-bit), the nodes are often getting stuck in state where CMS
 collections of the old space are constantly running.

 The JVM configuration is using the standard settings in cassandra-env --
 relevant settings are included below.  The max heap is currently set to 5
 GB with 800MB for new size.  I don't believe that the cluster is overly
 busy and seems to be performing well enough other than this issue.  When
 nodes get into this state they never seem to leave it (by freeing up old
 space memory) without restarting cassandra.  They typically enter this
 state while running nodetool repair -pr but once they start doing this,
 restarting them only fixes it for a couple of hours.

 Compactions are completing and are generally not queued up.  All CF are
 using STCS.  The busiest CF consumes about 100GB of space on disk, is write
 heavy, and all columns have a TTL of 3 days.  Overall, there are 41 CF
 including those used for system keyspace and secondary indexes.  The number
 of SSTables per node currently varies from 185-212.

 Other than frequent log warnings about GCInspector  - Heap is 0.xxx
 full... and StorageService  - Flushing CFS(...) to relieve memory
 pressure there are no other log entries to indicate there is a problem.

 Does the memory needed vary depending on the amount of data stored?  If
 so, how can I predict how much jvm space is needed?  I don't want to make
 the heap too large as that's bad too.  Maybe there's a memory leak related
 to compaction that doesn't allow meta-data to be purged?


 -Bryan


 12 GB of RAM in host with ~6 GB used by java and ~6 GB for OS and buffer
 cache.
 $ free -m
  total   used   free sharedbuffers cached
 Mem: 12001  11870131  0  4   5778
 -/+ buffers/cache:   6087   5914
 Swap:0  0  0


 jvm settings in cassandra-env
 MAX_HEAP_SIZE=5G
 HEAP_NEWSIZE=800M

 # GC tuning options
 JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC
 JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC
 JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled
 JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8
 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=1
 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75
 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly
 JVM_OPTS=$JVM_OPTS -XX:+UseCompressedOops


 jstat shows about 12 full collections per minute with old heap usage
 constantly over 75% so CMS is always over the
 CMSInitiatingOccupancyFraction threshold.

 $ jstat -gcutil -t 22917 5000 4
 Timestamp S0 S1 E  O  P YGC YGCTFGC
  FGCT GCT
132063.0  34.70   0.00  26.03  82.29  59.88  21580  506.887 17523
 3078.941 3585.829
132068.0  34.70   0.00  50.02  81.23  59.88  21580  506.887 17524
 3079.220 3586.107
132073.1   0.00  24.92  46.87  81.41  59.88  21581  506.932 17525
 3079.583 3586.515
132078.1   0.00  24.92  64.71  81.40  59.88  21581  506.932 17527
 3079.853 3586.785


 Other hosts not currently experiencing the high CPU load have a heap less
 than .75 full.

 $ jstat -gcutil -t 6063 5000 4
 Timestamp S0 S1 E  O  P YGC YGCTFGC
  FGCT GCT
520731.6   0.00  12.70  36.37  71.33  59.26  46453 1688.809 14785
 2130.779 3819.588
520736.5   0.00  12.70  53.25  71.33  59.26  46453 1688.809 14785
 2130.779 3819.588
520741.5   0.00  12.70  68.92  71.33  59.26  46453 1688.809 14785
 2130.779 3819.588
520746.5   0.00  12.70  83.11  71.33  59.26  46453 1688.809 14785
 2130.779 3819.588