[ https://issues.apache.org/jira/browse/CASSANDRA-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aleksey Yeschenko reopened CASSANDRA-8447: ------------------------------------------ > Nodes stuck in CMS GC cycle with very little traffic when compaction is > enabled > ------------------------------------------------------------------------------- > > Key: CASSANDRA-8447 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8447 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Cluster size - 4 nodes > Node size - 12 CPU (hyper threaded to 24 cores), 192 GB RAM, 2 Raid 0 arrays > (Data - 10 disk, spinning 10k drives | CL 2 disk, spinning 10k drives) > OS - RHEL 6.5 > jvm - oracle 1.7.0_71 > Cassandra version 2.0.11 > Reporter: jonathan lacefield > Fix For: 2.0.12 > > Attachments: Node_with_compaction.png, Node_without_compaction.png, > cassandra.yaml, gc.logs.tar.gz, gcinspector_messages.txt, memtable_debug, > output.1.svg, output.2.svg, output.svg, results.tar.gz, visualvm_screenshot > > > Behavior - If autocompaction is enabled, nodes will become unresponsive due > to a full Old Gen heap which is not cleared during CMS GC. > Test methodology - disabled autocompaction on 3 nodes, left autocompaction > enabled on 1 node. Executed different Cassandra stress loads, using write > only operations. Monitored visualvm and jconsole for heap pressure. > Captured iostat and dstat for most tests. Captured heap dump from 50 thread > load. Hints were disabled for testing on all nodes to alleviate GC noise due > to hints backing up. > Data load test through Cassandra stress - /usr/bin/cassandra-stress write > n=1900000000 -rate threads=<different threads tested> -schema > replication\(factor=3\) keyspace="Keyspace1" -node <all nodes listed> > Data load thread count and results: > * 1 thread - Still running but looks like the node can sustain this load > (approx 500 writes per second per node) > * 5 threads - Nodes become unresponsive due to full Old Gen Heap. CMS > measured in the 60 second range (approx 2k writes per second per node) > * 10 threads - Nodes become unresponsive due to full Old Gen Heap. CMS > measured in the 60 second range > * 50 threads - Nodes become unresponsive due to full Old Gen Heap. CMS > measured in the 60 second range (approx 10k writes per second per node) > * 100 threads - Nodes become unresponsive due to full Old Gen Heap. CMS > measured in the 60 second range (approx 20k writes per second per node) > * 200 threads - Nodes become unresponsive due to full Old Gen Heap. CMS > measured in the 60 second range (approx 25k writes per second per node) > Note - the observed behavior was the same for all tests except for the single > threaded test. The single threaded test does not appear to show this > behavior. > Tested different GC and Linux OS settings with a focus on the 50 and 200 > thread loads. > JVM settings tested: > # default, out of the box, env-sh settings > # 10 G Max | 1 G New - default env-sh settings > # 10 G Max | 1 G New - default env-sh settings > #* JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50" > # 20 G Max | 10 G New > JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC" > JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC" > JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled" > JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8" > JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=8" > JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75" > JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly" > JVM_OPTS="$JVM_OPTS -XX:+UseTLAB" > JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark" > JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=60000" > JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=30000" > JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=12" > JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=12" > JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions" > JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity" > JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs" > JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768" > JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking" > # 20 G Max | 1 G New > JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC" > JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC" > JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled" > JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8" > JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=8" > JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75" > JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly" > JVM_OPTS="$JVM_OPTS -XX:+UseTLAB" > JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark" > JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=60000" > JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=30000" > JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=12" > JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=12" > JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions" > JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity" > JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs" > JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768" > JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking" > Linux OS settings tested: > # Disabled Transparent Huge Pages > echo never > /sys/kernel/mm/transparent_hugepage/enabled > echo never > /sys/kernel/mm/transparent_hugepage/defrag > # Enabled Huge Pages > echo 21500000000 > /proc/sys/kernel/shmmax (over 20GB for heap) > echo 1536 > /proc/sys/vm/nr_hugepages (20GB/2MB page size) > # Disabled NUMA > numa-off in /etc/grub.confdatastax > # Verified all settings documented here were implemented > > http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installRecommendSettings.html > Attachments: > # .yaml > # fio output - results.tar.gz > # 50 thread heap dump - > https://drive.google.com/a/datastax.com/file/d/0B4Imdpu2YrEbMGpCZW5ta2liQ2c/view?usp=sharing > # 100 thread - visual vm anonymous screenshot - visualvm_screenshot > # dstat screen shot of with compaction - Node_with_compaction.png > # dstat screen shot of without compaction -- Node_without_compaction.png > # gcinspector messages from system.log > # gc.log output - gc.logs.tar.gz > Observations: > # even though this is a spinning disk implementation, disk io looks good. > #* output from Jshook perf monitor https://github.com/jshook/perfscripts is > attached > #* note, we leveraged direct io for all tests by adding direct=1 to the > .global config files > # cpu usage is moderate until large GC events occur > # once old gen heap fills up and cannot clean, memtable post flushers start > to back up (show a lot pending) via tpstats > # the node itself, i.e. ssh, is still responsive but the Cassandra instance > becomes unresponsive > # once old gen heap fills up Cassandra stress starts to throw CL ONE errors > stating there aren't enough replicas to satisfy.... > # heap dump from 50 thread, JVM scenario 1 is attached > #* appears to show a compaction thread consuming a lot of memory > # sample system.log output for gc issues > # strace -e futex -p $PID -f -c output during 100 thread load and during old > gen "filling", just before full > % time seconds usecs/call calls errors syscall > 100.00 244.886766 4992 49052 7507 futex > 100.00 244.886766 49052 7507 total > # htop during full gc cycle - > https://s3.amazonaws.com/uploads.hipchat.com/6528/480117/4ZlgcoNScb6kRM2/upload.png > # nothing is blocked via tpstats on these nodes > # compaction does have pending tasks, upwards of 20, on the nodes > # Nodes without compaction achieved approximately 20k writes per second per > node without errors or drops > Next Steps: > # Will try to create a flame graph and update load here - > http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html > # Will try to recreate in another environment -- This message was sent by Atlassian JIRA (v6.3.4#6332)