[ https://issues.apache.org/jira/browse/CASSANDRA-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242183#comment-14242183 ]
Philo Yang commented on CASSANDRA-8447: --------------------------------------- I have the same trouble that full gc can not reduce the size of old gen. Days ago I post this problem to the maillist, people think it will be solved by tuning the gc setting, however it doesn't work for me. After seeing this issue I think it may be a bug? I can offer some information in my cluster, hope I can help you find the bug if it is a bug. Of course, the reason of my trouble may be different with [~jlacefie]. The unresponsive only appears in some nodes, in other words, some nodes unresponsive several times a day, the other nodes never. When there is no trouble, the load for all nodes are same, so I think the unresponsive is not because the heavy load. Before the node being unresponsive, it will be easily to find by jstat that after CMS gc, old gen is still very large (in usual it will be only less than 1GB after full gc but when the trouble comes it will be still more than 4GB after CMS gc). And there may be a compaction that stucks for many minutes in 99% even 99.99% like this: pending tasks: 1 compaction type keyspace table completed total unit progress Compaction keyspace table 354680703 354710642 bytes 99.99% But I'm not sure the trouble is always with the compaction stuck because I don't follow all the unresponsive. I use jmap to print the object that holds in heap, I don't know if it is helpful to you: num #instances #bytes class name ---------------------------------------------- 1: 11899268 3402792016 [B 2: 23734819 1139271312 java.nio.HeapByteBuffer 3: 11140273 306165600 [Ljava.nio.ByteBuffer; 4: 9484838 227636112 org.apache.cassandra.db.composites.CompoundComposite 5: 8220604 197294496 org.apache.cassandra.db.composites.BoundedComposite 6: 27187 69131928 [J 7: 1673344 53547008 org.apache.cassandra.db.composites.CompoundSparseCellName 8: 1540101 49283232 org.apache.cassandra.db.BufferCell 9: 3158 45471360 [Lorg.apache.cassandra.db.composites.Composite; 10: 2527 27865040 [I 11: 251797 20236456 [Ljava.lang.Object; 12: 417752 12899976 [C 13: 263201 10528040 org.apache.cassandra.db.BufferExpiringCell 14: 322324 10314368 com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$Node 15: 322237 10311584 com.googlecode.concurrentlinkedhashmap.ConcurrentHashMapV8$Node 16: 417331 10015944 java.lang.String 17: 86368 8891280 [Lorg.apache.cassandra.db.Cell; 18: 349917 8398008 org.apache.cassandra.cql3.ColumnIdentifier 19: 204161 8166440 java.util.TreeMap$Entry 20: 322324 7735776 com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$WeightedValue 21: 322324 7735776 org.apache.cassandra.cache.KeyCacheKey 22: 317274 7614576 java.lang.Double 23: 317154 7611696 org.apache.cassandra.db.RowIndexEntry 24: 314642 7551408 java.util.concurrent.ConcurrentSkipListMap$Node 25: 52560 7316584 <constMethodKlass> 26: 292136 7011264 java.lang.Long 27: 52560 6740064 <methodKlass> 28: 5290 5937512 <constantPoolKlass> 29: 160281 3846744 org.apache.cassandra.db.BufferDecoratedKey 30: 155777 3738648 java.util.concurrent.ConcurrentSkipListMap$Index 31: 5290 3642232 <instanceKlassKlass> 32: 150284 3606816 org.apache.cassandra.db.AtomicBTreeColumns 33: 150261 3606264 org.apache.cassandra.db.AtomicBTreeColumns$Holder 34: 87861 3514440 org.apache.cassandra.db.ArrayBackedSortedColumns 35: 87768 3510720 org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask 36: 4634 3372096 <constantPoolCacheKlass> 37: 84262 3370480 java.util.Collections$SingletonMap 38: 6243 2778728 <methodDataKlass> 39: 173490 2775840 org.apache.cassandra.dht.LongToken 40: 82000 2624000 java.util.RegularEnumSet 41: 81981 2623392 org.apache.cassandra.net.MessageIn 42: 81980 2623360 org.apache.cassandra.net.MessageDeliveryTask 43: 102511 2460264 java.util.concurrent.ConcurrentLinkedQueue$Node 44: 94901 2277624 org.apache.cassandra.db.DeletionInfo 45: 93837 2252088 java.util.concurrent.Executors$RunnableAdapter 46: 140525 2248400 org.apache.cassandra.db.composites.SimpleDenseCellName 47: 87753 2106072 org.apache.cassandra.db.Mutation 48: 1 2097168 [Lcom.googlecode.concurrentlinkedhashmap.ConcurrentHashMapV8$Node; 49: 82130 1971120 java.net.InetAddress$InetAddressHolder 50: 82127 1971048 java.net.Inet4Address I jstack the jvm, there is no deadlock and I can not find any porblem. I post it to https://gist.github.com/yangzhe1991/755ea2a10520be1fe59a Hope to help you solve this issue:) > Nodes stuck in CMS GC cycle with very little traffic when compaction is > enabled > ------------------------------------------------------------------------------- > > Key: CASSANDRA-8447 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8447 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Cluster size - 4 nodes > Node size - 12 CPU (hyper threaded to 24 cores), 192 GB RAM, 2 Raid 0 arrays > (Data - 10 disk, spinning 10k drives | CL 2 disk, spinning 10k drives) > OS - RHEL 6.5 > jvm - oracle 1.7.0_71 > Cassandra version 2.0.11 > Reporter: jonathan lacefield > Attachments: Node_with_compaction.png, Node_without_compaction.png, > cassandra.yaml, gc.logs.tar.gz, gcinspector_messages.txt, memtable_debug, > results.tar.gz, visualvm_screenshot > > > Behavior - If autocompaction is enabled, nodes will become unresponsive due > to a full Old Gen heap which is not cleared during CMS GC. > Test methodology - disabled autocompaction on 3 nodes, left autocompaction > enabled on 1 node. Executed different Cassandra stress loads, using write > only operations. Monitored visualvm and jconsole for heap pressure. > Captured iostat and dstat for most tests. Captured heap dump from 50 thread > load. Hints were disabled for testing on all nodes to alleviate GC noise due > to hints backing up. > Data load test through Cassandra stress - /usr/bin/cassandra-stress write > n=1900000000 -rate threads=<different threads tested> -schema > replication\(factor=3\) keyspace="Keyspace1" -node <all nodes listed> > Data load thread count and results: > * 1 thread - Still running but looks like the node can sustain this load > (approx 500 writes per second per node) > * 5 threads - Nodes become unresponsive due to full Old Gen Heap. CMS > measured in the 60 second range (approx 2k writes per second per node) > * 10 threads - Nodes become unresponsive due to full Old Gen Heap. CMS > measured in the 60 second range > * 50 threads - Nodes become unresponsive due to full Old Gen Heap. CMS > measured in the 60 second range (approx 10k writes per second per node) > * 100 threads - Nodes become unresponsive due to full Old Gen Heap. CMS > measured in the 60 second range (approx 20k writes per second per node) > * 200 threads - Nodes become unresponsive due to full Old Gen Heap. CMS > measured in the 60 second range (approx 25k writes per second per node) > Note - the observed behavior was the same for all tests except for the single > threaded test. The single threaded test does not appear to show this > behavior. > Tested different GC and Linux OS settings with a focus on the 50 and 200 > thread loads. > JVM settings tested: > # default, out of the box, env-sh settings > # 10 G Max | 1 G New - default env-sh settings > # 10 G Max | 1 G New - default env-sh settings > #* JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50" > # 20 G Max | 10 G New > JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC" > JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC" > JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled" > JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8" > JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=8" > JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75" > JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly" > JVM_OPTS="$JVM_OPTS -XX:+UseTLAB" > JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark" > JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=60000" > JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=30000" > JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=12" > JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=12" > JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions" > JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity" > JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs" > JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768" > JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking" > # 20 G Max | 1 G New > JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC" > JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC" > JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled" > JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8" > JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=8" > JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75" > JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly" > JVM_OPTS="$JVM_OPTS -XX:+UseTLAB" > JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark" > JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=60000" > JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=30000" > JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=12" > JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=12" > JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions" > JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity" > JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs" > JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768" > JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking" > Linux OS settings tested: > # Disabled Transparent Huge Pages > echo never > /sys/kernel/mm/transparent_hugepage/enabled > echo never > /sys/kernel/mm/transparent_hugepage/defrag > # Enabled Huge Pages > echo 21500000000 > /proc/sys/kernel/shmmax (over 20GB for heap) > echo 1536 > /proc/sys/vm/nr_hugepages (20GB/2MB page size) > # Disabled NUMA > numa-off in /etc/grub.confdatastax > # Verified all settings documented here were implemented > > http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installRecommendSettings.html > Attachments: > # .yaml > # fio output - results.tar.gz > # 50 thread heap dump - > https://drive.google.com/a/datastax.com/file/d/0B4Imdpu2YrEbMGpCZW5ta2liQ2c/view?usp=sharing > # 100 thread - visual vm anonymous screenshot - visualvm_screenshot > # dstat screen shot of with compaction - Node_with_compaction.png > # dstat screen shot of without compaction -- Node_without_compaction.png > # gcinspector messages from system.log > # gc.log output - gc.logs.tar.gz > Observations: > # even though this is a spinning disk implementation, disk io looks good. > #* output from Jshook perf monitor https://github.com/jshook/perfscripts is > attached > #* note, we leveraged direct io for all tests by adding direct=1 to the > .global config files > # cpu usage is moderate until large GC events occur > # once old gen heap fills up and cannot clean, memtable post flushers start > to back up (show a lot pending) via tpstats > # the node itself, i.e. ssh, is still responsive but the Cassandra instance > becomes unresponsive > # once old gen heap fills up Cassandra stress starts to throw CL ONE errors > stating there aren't enough replicas to satisfy.... > # heap dump from 50 thread, JVM scenario 1 is attached > #* appears to show a compaction thread consuming a lot of memory > # sample system.log output for gc issues > # strace -e futex -p $PID -f -c output during 100 thread load and during old > gen "filling", just before full > % time seconds usecs/call calls errors syscall > 100.00 244.886766 4992 49052 7507 futex > 100.00 244.886766 49052 7507 total > # htop during full gc cycle - > https://s3.amazonaws.com/uploads.hipchat.com/6528/480117/4ZlgcoNScb6kRM2/upload.png > # nothing is blocked via tpstats on these nodes > # compaction does have pending tasks, upwards of 20, on the nodes > # Nodes without compaction achieved approximately 20k writes per second per > node without errors or drops > Next Steps: > # Will try to create a flame graph and update load here - > http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html > # Will try to recreate in another environment -- This message was sent by Atlassian JIRA (v6.3.4#6332)