Re: constant CMS GC using CPU time
How does compaction_throughput relate to memory usage? It reduces the rate of memory allocation. e.g. Say normally ParNew can keep up with the rate of memory usage without stopping for too long: so the rate of promotion is low'ish and every thing is allocated to Eden. If the allocation rate gets higher ParNew may be more frequent and objects may be promoted to tenured that don't really need to be there. I assumed that was more for IO tuning. I noticed that lowering concurrent_compactors to 4 (from default of 8) lowered the memory used during compactions. Similar thing to above. This may reduce the number of rows held in memory at any instant for compaction. Only rows less than in_memory_compaction_limit are loaded into memory during compaction. So reducing that may reduce the memory usage. Since then I've reduced the TTL to 1 hour and set gc_grace_seconds to 0 so the number of rows and data dropped to a level it can handle. Cool. Sorry if took so long to get there. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 26/10/2012, at 8:08 AM, Bryan Talbot btal...@aeriagames.com wrote: On Thu, Oct 25, 2012 at 4:15 AM, aaron morton aa...@thelastpickle.com wrote: This sounds very much like my heap is so consumed by (mostly) bloom filters that I am in steady state GC thrash. Yes, I think that was at least part of the issue. The rough numbers I've used to estimate working set are: * bloom filter size for 400M rows at 0.00074 fp without java fudge (they are just a big array) 714 MB * memtable size 1024 MB * index sampling: * 24 bytes + key (16 bytes for UUID) = 32 bytes * 400M / 128 default sampling = 3,125,000 * 3,125,000 * 32 = 95 MB * java fudge X5 or X10 = 475MB to 950MB * ignoring row cache and key cache So the high side number is 2213 to 2,688. High because the fudge is a delicious sticky guess and the memtable space would rarely be full. On a 5120 MB heap, with 800MB new you have roughly 4300 MB tenured (some goes to perm) and 75% of that is 3,225 MB. Not terrible but it depends on the working set and how quickly stuff get's tenured which depends on the workload. These values seem reasonable and in line with what I was seeing. There are other CF and apps sharing this cluster but this one was the largest. You can confirm these guesses somewhat manually by enabling all the GC logging in cassandra-env.sh. Restart the node and let it operate normally, probably best to keep repair off. I was using jstat to monitor gc activity and some snippets from that are in my original email in this thread. The key behavior was that full gc was running pretty often and never able to reclaim much (if any) space. There are a few things you could try: * increase the JVM heap by say 1Gb and see how it goes * increase bloom filter false positive, try 0.1 first (see http://www.datastax.com/docs/1.1/configuration/storage_configuration#bloom-filter-fp-chance) * increase index_interval sampling in yaml. * decreasing compaction_throughput and in_memory_compaction_limit can lesson the additional memory pressure compaction adds. * disable caches or ensure off heap caches are used. I've done several of these already in addition to changing the app to reduce the number of rows retained. How does compaction_throughput relate to memory usage? I assumed that was more for IO tuning. I noticed that lowering concurrent_compactors to 4 (from default of 8) lowered the memory used during compactions. in_memory_compaction_limit_in_mb seems to only be used for wide rows and this CF didn't have any wider than in_memory_compaction_limit_in_mb. My multithreaded_compaction is still false. Watching the gc logs and the cassandra log is a great way to get a feel for what works in your situation. Also take note of any scheduled processing your app does which may impact things, and look for poorly performing queries. Finally this book is a good reference on Java GC http://amzn.com/0137142528 For my understanding what was the average row size for the 400 million keys ? The compacted row mean size for the CF is 8815 (as reported by cfstats) but that comes out to be much larger than the real load per node I was seeing. Each node had about 200GB of data for the CF with 4 nodes in the cluster and RF=3. At the time, the TTL for all columns was 3 days and gc_grace_seconds was 5 days. Since then I've reduced the TTL to 1 hour and set gc_grace_seconds to 0 so the number of rows and data dropped to a level it can handle. -Bryan
Re: constant CMS GC using CPU time
Regarding memory usage after a repair ... Are the merkle trees kept around? They should not be. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 24/10/2012, at 4:51 PM, B. Todd Burruss bto...@gmail.com wrote: Regarding memory usage after a repair ... Are the merkle trees kept around? On Oct 23, 2012 3:00 PM, Bryan Talbot btal...@aeriagames.com wrote: On Mon, Oct 22, 2012 at 6:05 PM, aaron morton aa...@thelastpickle.com wrote: The GC was on-going even when the nodes were not compacting or running a heavy application load -- even when the main app was paused constant the GC continued. If you restart a node is the onset of GC activity correlated to some event? Yes and no. When the nodes were generally under the .75 occupancy threshold a weekly repair -pr job would cause them to go over the threshold and then stay there even after the repair had completed and there were no ongoing compactions. It acts as though at least some substantial amount of memory used during repair was never dereferenced once the repair was complete. Once one CF in particular grew larger the constant GC would start up pretty soon (less than 90 minutes) after a node restart even without a repair. As a test we dropped the largest CF and the memory usage immediately dropped to acceptable levels and the constant GC stopped. So it's definitely related to data load. memtable size is 1 GB, row cache is disabled and key cache is small (default). How many keys did the CF have per node? I dismissed the memory used to hold bloom filters and index sampling. That memory is not considered part of the memtable size, and will end up in the tenured heap. It is generally only a problem with very large key counts per node. I've changed the app to retain less data for that CF but I think that it was about 400M rows per node. Row keys are a TimeUUID. All of the rows are write-once, never updated, and rarely read. There are no secondary indexes for this particular CF. They were 2+ GB (as reported by nodetool cfstats anyway). It looks like the default bloom_filter_fp_chance defaults to 0.0 The default should be 0.000744. If the chance is zero or null this code should run when a new SSTable is written // paranoia -- we've had bugs in the thrift - avro - CfDef dance before, let's not let that break things logger.error(Bloom filter FP chance of zero isn't supposed to happen); Were the CF's migrated from an old version ? Yes, the CF were created in 1.0.9, then migrated to 1.0.11 and finally to 1.1.5 with a upgradesstables run at each upgrade along the way. I could not find a way to view the current bloom_filter_fp_chance settings when they are at a default value. JMX reports the actual fp rate and if a specific rate is set for a CF that shows up in describe table but I couldn't find out how to tell what the default was. I didn't inspect the source. Is there any way to predict how much memory the bloom filters will consume if the size of the row keys, number or rows is known, and fp chance is known? See o.a.c.utils.BloomFilter.getFilter() in the code This http://hur.st/bloomfilter appears to give similar results. Ahh, very helpful. This indicates that 714MB would be used for the bloom filter for that one CF. JMX / cfstats reports Bloom Filter Space Used but the MBean method name (getBloomFilterDiskSpaceUsed) indicates this is the on-disk space. If on-disk and in-memory space used is similar then summing up all the Bloom Filter Space Used says they're currently consuming 1-2 GB of the heap which is substantial. If a CF is rarely read is it safe to set bloom_filter_fp_chance to 1.0? It just means more trips to SSTable indexes for a read correct? Trade RAM for time (disk I/O). -Bryan
Re: constant CMS GC using CPU time
This sounds very much like my heap is so consumed by (mostly) bloom filters that I am in steady state GC thrash. Yes, I think that was at least part of the issue. The rough numbers I've used to estimate working set are: * bloom filter size for 400M rows at 0.00074 fp without java fudge (they are just a big array) 714 MB * memtable size 1024 MB * index sampling: * 24 bytes + key (16 bytes for UUID) = 32 bytes * 400M / 128 default sampling = 3,125,000 * 3,125,000 * 32 = 95 MB * java fudge X5 or X10 = 475MB to 950MB * ignoring row cache and key cache So the high side number is 2213 to 2,688. High because the fudge is a delicious sticky guess and the memtable space would rarely be full. On a 5120 MB heap, with 800MB new you have roughly 4300 MB tenured (some goes to perm) and 75% of that is 3,225 MB. Not terrible but it depends on the working set and how quickly stuff get's tenured which depends on the workload. You can confirm these guesses somewhat manually by enabling all the GC logging in cassandra-env.sh. Restart the node and let it operate normally, probably best to keep repair off. This is a sample (partial) GC log before CMS kicks in note concurrent mark-sweep used size. Your values may differ, this has non default settings: Heap after GC invocations=9947 (full 182): par new generation total 1024000K, used 101882K [0x0006fae0, 0x000745e0, 0x000745e0) eden space 819200K, 0% used [0x0006fae0, 0x0006fae0, 0x00072ce0) from space 204800K, 49% used [0x00073960, 0x00073f97eaf8, 0x000745e0) to space 204800K, 0% used [0x00072ce0, 0x00072ce0, 0x00073960) concurrent mark-sweep generation total 2965504K, used 2309885K [0x000745e0, 0x0007fae0, 0x0007fae0) concurrent-mark-sweep perm gen total 38052K, used 22811K [0x0007fae0, 0x0007fd329000, 0x0008) } This is when it starts, see (full X) count increases : 2012-10-25T03:32:44.664-0500: 76691.891: [GC [1 CMS-initial-mark: 2309885K(2965504K)] 2411929K(3989504K), 0.0047910 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] Total time for which application threads were stopped: 0.0059850 seconds (other CMS type logs) {Heap before GC invocations=9947 (full 183): par new generation total 1024000K, used 921082K [0x0006fae0, 0x000745e0, 0x000745e0) eden space 819200K, 100% used [0x0006fae0, 0x00072ce0, 0x00072ce0) from space 204800K, 49% used [0x00073960, 0x00073f97eaf8, 0x000745e0) to space 204800K, 0% used [0x00072ce0, 0x00072ce0, 0x00073960) concurrent mark-sweep generation total 2965504K, used 2206292K [0x000745e0, 0x0007fae0, 0x0007fae0) concurrent-mark-sweep perm gen total 38052K, used 22811K [0x0007fae0, 0x0007fd329000, 0x0008) A couple of log messages later concurrent mark-sweep used size is down: {Heap before GC invocations=9948 (full 183): par new generation total 1024000K, used 938695K [0x0006fae0, 0x000745e0, 0x000745e0) eden space 819200K, 100% used [0x0006fae0, 0x00072ce0, 0x00072ce0) from space 204800K, 58% used [0x00072ce0, 0x0007342b1f00, 0x00073960) to space 204800K, 0% used [0x00073960, 0x00073960, 0x000745e0) concurrent mark-sweep generation total 2965504K, used 1096146K [0x000745e0, 0x0007fae0, 0x0007fae0) concurrent-mark-sweep perm gen total 38052K, used 22811K [0x0007fae0, 0x0007fd329000, 0x0008) 2012-10-25T03:32:50.479-0500: 76697.706: [GC Before GC: There are a few things you could try: * increase the JVM heap by say 1Gb and see how it goes * increase bloom filter false positive, try 0.1 first (see http://www.datastax.com/docs/1.1/configuration/storage_configuration#bloom-filter-fp-chance) * increase index_interval sampling in yaml. * decreasing compaction_throughput and in_memory_compaction_limit can lesson the additional memory pressure compaction adds. * disable caches or ensure off heap caches are used. Watching the gc logs and the cassandra log is a great way to get a feel for what works in your situation. Also take note of any scheduled processing your app does which may impact things, and look for poorly performing queries. Finally this book is a good reference on Java GC http://amzn.com/0137142528 For my understanding what was the average row size for the 400 million keys ? Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 25/10/2012, at 1:22 PM, Bryan Talbot btal...@aeriagames.com wrote: On Wed, Oct 24, 2012 at 2:38 PM, Rob Coli rc...@palominodb.com wrote: On Mon, Oct 22, 2012 at 8:38 AM, Bryan Talbot
Re: constant CMS GC using CPU time
On Thu, Oct 25, 2012 at 4:15 AM, aaron morton aa...@thelastpickle.comwrote: This sounds very much like my heap is so consumed by (mostly) bloom filters that I am in steady state GC thrash. Yes, I think that was at least part of the issue. The rough numbers I've used to estimate working set are: * bloom filter size for 400M rows at 0.00074 fp without java fudge (they are just a big array) 714 MB * memtable size 1024 MB * index sampling: * 24 bytes + key (16 bytes for UUID) = 32 bytes * 400M / 128 default sampling = 3,125,000 * 3,125,000 * 32 = 95 MB * java fudge X5 or X10 = 475MB to 950MB * ignoring row cache and key cache So the high side number is 2213 to 2,688. High because the fudge is a delicious sticky guess and the memtable space would rarely be full. On a 5120 MB heap, with 800MB new you have roughly 4300 MB tenured (some goes to perm) and 75% of that is 3,225 MB. Not terrible but it depends on the working set and how quickly stuff get's tenured which depends on the workload. These values seem reasonable and in line with what I was seeing. There are other CF and apps sharing this cluster but this one was the largest. You can confirm these guesses somewhat manually by enabling all the GC logging in cassandra-env.sh. Restart the node and let it operate normally, probably best to keep repair off. I was using jstat to monitor gc activity and some snippets from that are in my original email in this thread. The key behavior was that full gc was running pretty often and never able to reclaim much (if any) space. There are a few things you could try: * increase the JVM heap by say 1Gb and see how it goes * increase bloom filter false positive, try 0.1 first (see http://www.datastax.com/docs/1.1/configuration/storage_configuration#bloom-filter-fp-chance ) * increase index_interval sampling in yaml. * decreasing compaction_throughput and in_memory_compaction_limit can lesson the additional memory pressure compaction adds. * disable caches or ensure off heap caches are used. I've done several of these already in addition to changing the app to reduce the number of rows retained. How does compaction_throughput relate to memory usage? I assumed that was more for IO tuning. I noticed that lowering concurrent_compactors to 4 (from default of 8) lowered the memory used during compactions. in_memory_compaction_limit_in_mb seems to only be used for wide rows and this CF didn't have any wider than in_memory_compaction_limit_in_mb. My multithreaded_compaction is still false. Watching the gc logs and the cassandra log is a great way to get a feel for what works in your situation. Also take note of any scheduled processing your app does which may impact things, and look for poorly performing queries. Finally this book is a good reference on Java GC http://amzn.com/0137142528 For my understanding what was the average row size for the 400 million keys ? The compacted row mean size for the CF is 8815 (as reported by cfstats) but that comes out to be much larger than the real load per node I was seeing. Each node had about 200GB of data for the CF with 4 nodes in the cluster and RF=3. At the time, the TTL for all columns was 3 days and gc_grace_seconds was 5 days. Since then I've reduced the TTL to 1 hour and set gc_grace_seconds to 0 so the number of rows and data dropped to a level it can handle. -Bryan
Re: constant CMS GC using CPU time
On Mon, Oct 22, 2012 at 8:38 AM, Bryan Talbot btal...@aeriagames.com wrote: The nodes with the most data used the most memory. All nodes are affected eventually not just one. The GC was on-going even when the nodes were not compacting or running a heavy application load -- even when the main app was paused constant the GC continued. This sounds very much like my heap is so consumed by (mostly) bloom filters that I am in steady state GC thrash. Do you have heap graphs which show a healthy sawtooth GC cycle which then more or less flatlines? =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: constant CMS GC using CPU time
On Wed, Oct 24, 2012 at 2:38 PM, Rob Coli rc...@palominodb.com wrote: On Mon, Oct 22, 2012 at 8:38 AM, Bryan Talbot btal...@aeriagames.com wrote: The nodes with the most data used the most memory. All nodes are affected eventually not just one. The GC was on-going even when the nodes were not compacting or running a heavy application load -- even when the main app was paused constant the GC continued. This sounds very much like my heap is so consumed by (mostly) bloom filters that I am in steady state GC thrash. Yes, I think that was at least part of the issue. Do you have heap graphs which show a healthy sawtooth GC cycle which then more or less flatlines? I didn't save any graphs but that is what they would look like. I was using jstat to monitor gc activity. -Bryan
Re: constant CMS GC using CPU time
These GC settings are the default (recommended?) settings from cassandra-env. I added the UseCompressedOops. -Bryan On Mon, Oct 22, 2012 at 6:15 PM, Will @ SOHO w...@voodoolunchbox.comwrote: On 10/22/2012 09:05 PM, aaron morton wrote: # GC tuning options JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=1 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseCompressedOops You are too far behind the reference JVM's. Parallel GC is the preferred and highest performing form in the current Security Baseline version of the JVM's. -- Bryan Talbot Architect / Platform team lead, Aeria Games and Entertainment Silicon Valley | Berlin | Tokyo | Sao Paulo
Re: constant CMS GC using CPU time
On Mon, Oct 22, 2012 at 6:05 PM, aaron morton aa...@thelastpickle.comwrote: The GC was on-going even when the nodes were not compacting or running a heavy application load -- even when the main app was paused constant the GC continued. If you restart a node is the onset of GC activity correlated to some event? Yes and no. When the nodes were generally under the .75 occupancy threshold a weekly repair -pr job would cause them to go over the threshold and then stay there even after the repair had completed and there were no ongoing compactions. It acts as though at least some substantial amount of memory used during repair was never dereferenced once the repair was complete. Once one CF in particular grew larger the constant GC would start up pretty soon (less than 90 minutes) after a node restart even without a repair. As a test we dropped the largest CF and the memory usage immediately dropped to acceptable levels and the constant GC stopped. So it's definitely related to data load. memtable size is 1 GB, row cache is disabled and key cache is small (default). How many keys did the CF have per node? I dismissed the memory used to hold bloom filters and index sampling. That memory is not considered part of the memtable size, and will end up in the tenured heap. It is generally only a problem with very large key counts per node. I've changed the app to retain less data for that CF but I think that it was about 400M rows per node. Row keys are a TimeUUID. All of the rows are write-once, never updated, and rarely read. There are no secondary indexes for this particular CF. They were 2+ GB (as reported by nodetool cfstats anyway). It looks like the default bloom_filter_fp_chance defaults to 0.0 The default should be 0.000744. If the chance is zero or null this code should run when a new SSTable is written // paranoia -- we've had bugs in the thrift - avro - CfDef dance before, let's not let that break things logger.error(Bloom filter FP chance of zero isn't supposed to happen); Were the CF's migrated from an old version ? Yes, the CF were created in 1.0.9, then migrated to 1.0.11 and finally to 1.1.5 with a upgradesstables run at each upgrade along the way. I could not find a way to view the current bloom_filter_fp_chance settings when they are at a default value. JMX reports the actual fp rate and if a specific rate is set for a CF that shows up in describe table but I couldn't find out how to tell what the default was. I didn't inspect the source. Is there any way to predict how much memory the bloom filters will consume if the size of the row keys, number or rows is known, and fp chance is known? See o.a.c.utils.BloomFilter.getFilter() in the code This http://hur.st/bloomfilter appears to give similar results. Ahh, very helpful. This indicates that 714MB would be used for the bloom filter for that one CF. JMX / cfstats reports Bloom Filter Space Used but the MBean method name (getBloomFilterDiskSpaceUsed) indicates this is the on-disk space. If on-disk and in-memory space used is similar then summing up all the Bloom Filter Space Used says they're currently consuming 1-2 GB of the heap which is substantial. If a CF is rarely read is it safe to set bloom_filter_fp_chance to 1.0? It just means more trips to SSTable indexes for a read correct? Trade RAM for time (disk I/O). -Bryan
Re: constant CMS GC using CPU time
Regarding memory usage after a repair ... Are the merkle trees kept around? On Oct 23, 2012 3:00 PM, Bryan Talbot btal...@aeriagames.com wrote: On Mon, Oct 22, 2012 at 6:05 PM, aaron morton aa...@thelastpickle.comwrote: The GC was on-going even when the nodes were not compacting or running a heavy application load -- even when the main app was paused constant the GC continued. If you restart a node is the onset of GC activity correlated to some event? Yes and no. When the nodes were generally under the .75 occupancy threshold a weekly repair -pr job would cause them to go over the threshold and then stay there even after the repair had completed and there were no ongoing compactions. It acts as though at least some substantial amount of memory used during repair was never dereferenced once the repair was complete. Once one CF in particular grew larger the constant GC would start up pretty soon (less than 90 minutes) after a node restart even without a repair. As a test we dropped the largest CF and the memory usage immediately dropped to acceptable levels and the constant GC stopped. So it's definitely related to data load. memtable size is 1 GB, row cache is disabled and key cache is small (default). How many keys did the CF have per node? I dismissed the memory used to hold bloom filters and index sampling. That memory is not considered part of the memtable size, and will end up in the tenured heap. It is generally only a problem with very large key counts per node. I've changed the app to retain less data for that CF but I think that it was about 400M rows per node. Row keys are a TimeUUID. All of the rows are write-once, never updated, and rarely read. There are no secondary indexes for this particular CF. They were 2+ GB (as reported by nodetool cfstats anyway). It looks like the default bloom_filter_fp_chance defaults to 0.0 The default should be 0.000744. If the chance is zero or null this code should run when a new SSTable is written // paranoia -- we've had bugs in the thrift - avro - CfDef dance before, let's not let that break things logger.error(Bloom filter FP chance of zero isn't supposed to happen); Were the CF's migrated from an old version ? Yes, the CF were created in 1.0.9, then migrated to 1.0.11 and finally to 1.1.5 with a upgradesstables run at each upgrade along the way. I could not find a way to view the current bloom_filter_fp_chance settings when they are at a default value. JMX reports the actual fp rate and if a specific rate is set for a CF that shows up in describe table but I couldn't find out how to tell what the default was. I didn't inspect the source. Is there any way to predict how much memory the bloom filters will consume if the size of the row keys, number or rows is known, and fp chance is known? See o.a.c.utils.BloomFilter.getFilter() in the code This http://hur.st/bloomfilter appears to give similar results. Ahh, very helpful. This indicates that 714MB would be used for the bloom filter for that one CF. JMX / cfstats reports Bloom Filter Space Used but the MBean method name (getBloomFilterDiskSpaceUsed) indicates this is the on-disk space. If on-disk and in-memory space used is similar then summing up all the Bloom Filter Space Used says they're currently consuming 1-2 GB of the heap which is substantial. If a CF is rarely read is it safe to set bloom_filter_fp_chance to 1.0? It just means more trips to SSTable indexes for a read correct? Trade RAM for time (disk I/O). -Bryan
Re: constant CMS GC using CPU time
If you are using the default settings I would try to correlate the GC activity with some application activity before tweaking. If this is happening on one machine out of 4 ensure that client load is distributed evenly. See if the raise in GC activity us related to Compaction, repair or an increase in throughput. OpsCentre or some other monitoring can help with the last one. Your mention of TTL makes me think compaction may be doing a bit of work churning through rows. Some things I've done in the past before looking at heap settings: * reduce compaction_throughput to reduce the memory churn * reduce in_memory_compaction_limit * if needed reduce concurrent_compactors Currently it seems like the memory used scales with the amount of bytes stored and not with how busy the server actually is. That's not such a good thing. The memtable_total_space_in_mb in yaml tells C* how much memory to devote to the memtables. That with the global row cache setting says how much memory will be used with regard to storing data and it will not increase inline with the static data load. Now days GC issues are typically due to more dynamic forces, like compaction, repair and throughput. Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 20/10/2012, at 6:59 AM, Bryan Talbot btal...@aeriagames.com wrote: ok, let me try asking the question a different way ... How does cassandra use memory and how can I plan how much is needed? I have a 1 GB memtable and 5 GB total heap and that's still not enough even though the number of concurrent connections and garbage generation rate is fairly low. If I were using mysql or oracle, I could compute how much memory could be used by N concurrent connections, how much is allocated for caching, temp spaces, etc. How can I do this for cassandra? Currently it seems like the memory used scales with the amount of bytes stored and not with how busy the server actually is. That's not such a good thing. -Bryan On Thu, Oct 18, 2012 at 11:06 AM, Bryan Talbot btal...@aeriagames.com wrote: In a 4 node cluster running Cassandra 1.1.5 with sun jvm 1.6.0_29-b11 (64-bit), the nodes are often getting stuck in state where CMS collections of the old space are constantly running. The JVM configuration is using the standard settings in cassandra-env -- relevant settings are included below. The max heap is currently set to 5 GB with 800MB for new size. I don't believe that the cluster is overly busy and seems to be performing well enough other than this issue. When nodes get into this state they never seem to leave it (by freeing up old space memory) without restarting cassandra. They typically enter this state while running nodetool repair -pr but once they start doing this, restarting them only fixes it for a couple of hours. Compactions are completing and are generally not queued up. All CF are using STCS. The busiest CF consumes about 100GB of space on disk, is write heavy, and all columns have a TTL of 3 days. Overall, there are 41 CF including those used for system keyspace and secondary indexes. The number of SSTables per node currently varies from 185-212. Other than frequent log warnings about GCInspector - Heap is 0.xxx full... and StorageService - Flushing CFS(...) to relieve memory pressure there are no other log entries to indicate there is a problem. Does the memory needed vary depending on the amount of data stored? If so, how can I predict how much jvm space is needed? I don't want to make the heap too large as that's bad too. Maybe there's a memory leak related to compaction that doesn't allow meta-data to be purged? -Bryan 12 GB of RAM in host with ~6 GB used by java and ~6 GB for OS and buffer cache. $ free -m total used free sharedbuffers cached Mem: 12001 11870131 0 4 5778 -/+ buffers/cache: 6087 5914 Swap:0 0 0 jvm settings in cassandra-env MAX_HEAP_SIZE=5G HEAP_NEWSIZE=800M # GC tuning options JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=1 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseCompressedOops jstat shows about 12 full collections per minute with old heap usage constantly over 75% so CMS is always over the CMSInitiatingOccupancyFraction threshold. $ jstat -gcutil -t 22917 5000 4 Timestamp S0 S1 E O P YGC YGCTFGC FGCT GCT 132063.0 34.70 0.00 26.03 82.29 59.88 21580 506.887 17523 3078.941 3585.829
Re: constant CMS GC using CPU time
The memory usage was correlated with the size of the data set. The nodes were a bit unbalanced which is normal due to variations in compactions. The nodes with the most data used the most memory. All nodes are affected eventually not just one. The GC was on-going even when the nodes were not compacting or running a heavy application load -- even when the main app was paused constant the GC continued. As a test we dropped the largest CF and the memory usage immediately dropped to acceptable levels and the constant GC stopped. So it's definitely related to data load. memtable size is 1 GB, row cache is disabled and key cache is small (default). I believe one culprit turned out to be the bloom filters. They were 2+ GB (as reported by nodetool cfstats anyway). It looks like the default bloom_filter_fp_chance defaults to 0.0 even though guides recommend 0.10 as the minimum value. Raising that to 0.20 for some write-mostly CF reduced memory used by 1GB or so. Is there any way to predict how much memory the bloom filters will consume if the size of the row keys, number or rows is known, and fp chance is known? -Bryan On Mon, Oct 22, 2012 at 12:25 AM, aaron morton aa...@thelastpickle.comwrote: If you are using the default settings I would try to correlate the GC activity with some application activity before tweaking. If this is happening on one machine out of 4 ensure that client load is distributed evenly. See if the raise in GC activity us related to Compaction, repair or an increase in throughput. OpsCentre or some other monitoring can help with the last one. Your mention of TTL makes me think compaction may be doing a bit of work churning through rows. Some things I've done in the past before looking at heap settings: * reduce compaction_throughput to reduce the memory churn * reduce in_memory_compaction_limit * if needed reduce concurrent_compactors Currently it seems like the memory used scales with the amount of bytes stored and not with how busy the server actually is. That's not such a good thing. The memtable_total_space_in_mb in yaml tells C* how much memory to devote to the memtables. That with the global row cache setting says how much memory will be used with regard to storing data and it will not increase inline with the static data load. Now days GC issues are typically due to more dynamic forces, like compaction, repair and throughput. Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 20/10/2012, at 6:59 AM, Bryan Talbot btal...@aeriagames.com wrote: ok, let me try asking the question a different way ... How does cassandra use memory and how can I plan how much is needed? I have a 1 GB memtable and 5 GB total heap and that's still not enough even though the number of concurrent connections and garbage generation rate is fairly low. If I were using mysql or oracle, I could compute how much memory could be used by N concurrent connections, how much is allocated for caching, temp spaces, etc. How can I do this for cassandra? Currently it seems like the memory used scales with the amount of bytes stored and not with how busy the server actually is. That's not such a good thing. -Bryan On Thu, Oct 18, 2012 at 11:06 AM, Bryan Talbot btal...@aeriagames.comwrote: In a 4 node cluster running Cassandra 1.1.5 with sun jvm 1.6.0_29-b11 (64-bit), the nodes are often getting stuck in state where CMS collections of the old space are constantly running. The JVM configuration is using the standard settings in cassandra-env -- relevant settings are included below. The max heap is currently set to 5 GB with 800MB for new size. I don't believe that the cluster is overly busy and seems to be performing well enough other than this issue. When nodes get into this state they never seem to leave it (by freeing up old space memory) without restarting cassandra. They typically enter this state while running nodetool repair -pr but once they start doing this, restarting them only fixes it for a couple of hours. Compactions are completing and are generally not queued up. All CF are using STCS. The busiest CF consumes about 100GB of space on disk, is write heavy, and all columns have a TTL of 3 days. Overall, there are 41 CF including those used for system keyspace and secondary indexes. The number of SSTables per node currently varies from 185-212. Other than frequent log warnings about GCInspector - Heap is 0.xxx full... and StorageService - Flushing CFS(...) to relieve memory pressure there are no other log entries to indicate there is a problem. Does the memory needed vary depending on the amount of data stored? If so, how can I predict how much jvm space is needed? I don't want to make the heap too large as that's bad too. Maybe there's a memory leak related to compaction that doesn't allow meta-data to be purged? -Bryan 12 GB of RAM
Re: constant CMS GC using CPU time
The GC was on-going even when the nodes were not compacting or running a heavy application load -- even when the main app was paused constant the GC continued. If you restart a node is the onset of GC activity correlated to some event? As a test we dropped the largest CF and the memory usage immediately dropped to acceptable levels and the constant GC stopped. So it's definitely related to data load. memtable size is 1 GB, row cache is disabled and key cache is small (default). How many keys did the CF have per node? I dismissed the memory used to hold bloom filters and index sampling. That memory is not considered part of the memtable size, and will end up in the tenured heap. It is generally only a problem with very large key counts per node. They were 2+ GB (as reported by nodetool cfstats anyway). It looks like the default bloom_filter_fp_chance defaults to 0.0 The default should be 0.000744. If the chance is zero or null this code should run when a new SSTable is written // paranoia -- we've had bugs in the thrift - avro - CfDef dance before, let's not let that break things logger.error(Bloom filter FP chance of zero isn't supposed to happen); Were the CF's migrated from an old version ? Is there any way to predict how much memory the bloom filters will consume if the size of the row keys, number or rows is known, and fp chance is known? See o.a.c.utils.BloomFilter.getFilter() in the code This http://hur.st/bloomfilter appears to give similar results. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 23/10/2012, at 4:38 AM, Bryan Talbot btal...@aeriagames.com wrote: The memory usage was correlated with the size of the data set. The nodes were a bit unbalanced which is normal due to variations in compactions. The nodes with the most data used the most memory. All nodes are affected eventually not just one. The GC was on-going even when the nodes were not compacting or running a heavy application load -- even when the main app was paused constant the GC continued. As a test we dropped the largest CF and the memory usage immediately dropped to acceptable levels and the constant GC stopped. So it's definitely related to data load. memtable size is 1 GB, row cache is disabled and key cache is small (default). I believe one culprit turned out to be the bloom filters. They were 2+ GB (as reported by nodetool cfstats anyway). It looks like the default bloom_filter_fp_chance defaults to 0.0 even though guides recommend 0.10 as the minimum value. Raising that to 0.20 for some write-mostly CF reduced memory used by 1GB or so. Is there any way to predict how much memory the bloom filters will consume if the size of the row keys, number or rows is known, and fp chance is known? -Bryan On Mon, Oct 22, 2012 at 12:25 AM, aaron morton aa...@thelastpickle.com wrote: If you are using the default settings I would try to correlate the GC activity with some application activity before tweaking. If this is happening on one machine out of 4 ensure that client load is distributed evenly. See if the raise in GC activity us related to Compaction, repair or an increase in throughput. OpsCentre or some other monitoring can help with the last one. Your mention of TTL makes me think compaction may be doing a bit of work churning through rows. Some things I've done in the past before looking at heap settings: * reduce compaction_throughput to reduce the memory churn * reduce in_memory_compaction_limit * if needed reduce concurrent_compactors Currently it seems like the memory used scales with the amount of bytes stored and not with how busy the server actually is. That's not such a good thing. The memtable_total_space_in_mb in yaml tells C* how much memory to devote to the memtables. That with the global row cache setting says how much memory will be used with regard to storing data and it will not increase inline with the static data load. Now days GC issues are typically due to more dynamic forces, like compaction, repair and throughput. Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 20/10/2012, at 6:59 AM, Bryan Talbot btal...@aeriagames.com wrote: ok, let me try asking the question a different way ... How does cassandra use memory and how can I plan how much is needed? I have a 1 GB memtable and 5 GB total heap and that's still not enough even though the number of concurrent connections and garbage generation rate is fairly low. If I were using mysql or oracle, I could compute how much memory could be used by N concurrent connections, how much is allocated for caching, temp spaces, etc. How can I do this for cassandra? Currently it seems like the memory used scales with the amount of bytes stored and not
Re: constant CMS GC using CPU time
On 10/22/2012 09:05 PM, aaron morton wrote: # GC tuning options JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=1 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseCompressedOops You are too far behind the reference JVM's. Parallel GC is the preferred and highest performing form in the current Security Baseline version of the JVM's.
Re: constant CMS GC using CPU time
Dne 18.10.2012 20:06, Bryan Talbot napsal(a): In a 4 node cluster running Cassandra 1.1.5 with sun jvm 1.6.0_29-b11 (64-bit), the nodes are often getting stuck in state where CMS collections of the old space are constantly running. you need more java heap memory
Re: constant CMS GC using CPU time
ok, let me try asking the question a different way ... How does cassandra use memory and how can I plan how much is needed? I have a 1 GB memtable and 5 GB total heap and that's still not enough even though the number of concurrent connections and garbage generation rate is fairly low. If I were using mysql or oracle, I could compute how much memory could be used by N concurrent connections, how much is allocated for caching, temp spaces, etc. How can I do this for cassandra? Currently it seems like the memory used scales with the amount of bytes stored and not with how busy the server actually is. That's not such a good thing. -Bryan On Thu, Oct 18, 2012 at 11:06 AM, Bryan Talbot btal...@aeriagames.comwrote: In a 4 node cluster running Cassandra 1.1.5 with sun jvm 1.6.0_29-b11 (64-bit), the nodes are often getting stuck in state where CMS collections of the old space are constantly running. The JVM configuration is using the standard settings in cassandra-env -- relevant settings are included below. The max heap is currently set to 5 GB with 800MB for new size. I don't believe that the cluster is overly busy and seems to be performing well enough other than this issue. When nodes get into this state they never seem to leave it (by freeing up old space memory) without restarting cassandra. They typically enter this state while running nodetool repair -pr but once they start doing this, restarting them only fixes it for a couple of hours. Compactions are completing and are generally not queued up. All CF are using STCS. The busiest CF consumes about 100GB of space on disk, is write heavy, and all columns have a TTL of 3 days. Overall, there are 41 CF including those used for system keyspace and secondary indexes. The number of SSTables per node currently varies from 185-212. Other than frequent log warnings about GCInspector - Heap is 0.xxx full... and StorageService - Flushing CFS(...) to relieve memory pressure there are no other log entries to indicate there is a problem. Does the memory needed vary depending on the amount of data stored? If so, how can I predict how much jvm space is needed? I don't want to make the heap too large as that's bad too. Maybe there's a memory leak related to compaction that doesn't allow meta-data to be purged? -Bryan 12 GB of RAM in host with ~6 GB used by java and ~6 GB for OS and buffer cache. $ free -m total used free sharedbuffers cached Mem: 12001 11870131 0 4 5778 -/+ buffers/cache: 6087 5914 Swap:0 0 0 jvm settings in cassandra-env MAX_HEAP_SIZE=5G HEAP_NEWSIZE=800M # GC tuning options JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=1 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseCompressedOops jstat shows about 12 full collections per minute with old heap usage constantly over 75% so CMS is always over the CMSInitiatingOccupancyFraction threshold. $ jstat -gcutil -t 22917 5000 4 Timestamp S0 S1 E O P YGC YGCTFGC FGCT GCT 132063.0 34.70 0.00 26.03 82.29 59.88 21580 506.887 17523 3078.941 3585.829 132068.0 34.70 0.00 50.02 81.23 59.88 21580 506.887 17524 3079.220 3586.107 132073.1 0.00 24.92 46.87 81.41 59.88 21581 506.932 17525 3079.583 3586.515 132078.1 0.00 24.92 64.71 81.40 59.88 21581 506.932 17527 3079.853 3586.785 Other hosts not currently experiencing the high CPU load have a heap less than .75 full. $ jstat -gcutil -t 6063 5000 4 Timestamp S0 S1 E O P YGC YGCTFGC FGCT GCT 520731.6 0.00 12.70 36.37 71.33 59.26 46453 1688.809 14785 2130.779 3819.588 520736.5 0.00 12.70 53.25 71.33 59.26 46453 1688.809 14785 2130.779 3819.588 520741.5 0.00 12.70 68.92 71.33 59.26 46453 1688.809 14785 2130.779 3819.588 520746.5 0.00 12.70 83.11 71.33 59.26 46453 1688.809 14785 2130.779 3819.588