[
https://issues.apache.org/jira/browse/CASSANDRA-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aleksey Yeschenko resolved CASSANDRA-8447.
------------------------------------------
Resolution: Duplicate
> Nodes stuck in CMS GC cycle with very little traffic when compaction is
> enabled
> -------------------------------------------------------------------------------
>
> Key: CASSANDRA-8447
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8447
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Environment: Cluster size - 4 nodes
> Node size - 12 CPU (hyper threaded to 24 cores), 192 GB RAM, 2 Raid 0 arrays
> (Data - 10 disk, spinning 10k drives | CL 2 disk, spinning 10k drives)
> OS - RHEL 6.5
> jvm - oracle 1.7.0_71
> Cassandra version 2.0.11
> Reporter: jonathan lacefield
> Fix For: 2.0.12
>
> Attachments: Node_with_compaction.png, Node_without_compaction.png,
> cassandra.yaml, gc.logs.tar.gz, gcinspector_messages.txt, memtable_debug,
> output.1.svg, output.2.svg, output.svg, results.tar.gz, visualvm_screenshot
>
>
> Behavior - If autocompaction is enabled, nodes will become unresponsive due
> to a full Old Gen heap which is not cleared during CMS GC.
> Test methodology - disabled autocompaction on 3 nodes, left autocompaction
> enabled on 1 node. Executed different Cassandra stress loads, using write
> only operations. Monitored visualvm and jconsole for heap pressure.
> Captured iostat and dstat for most tests. Captured heap dump from 50 thread
> load. Hints were disabled for testing on all nodes to alleviate GC noise due
> to hints backing up.
> Data load test through Cassandra stress - /usr/bin/cassandra-stress write
> n=1900000000 -rate threads=<different threads tested> -schema
> replication\(factor=3\) keyspace="Keyspace1" -node <all nodes listed>
> Data load thread count and results:
> * 1 thread - Still running but looks like the node can sustain this load
> (approx 500 writes per second per node)
> * 5 threads - Nodes become unresponsive due to full Old Gen Heap. CMS
> measured in the 60 second range (approx 2k writes per second per node)
> * 10 threads - Nodes become unresponsive due to full Old Gen Heap. CMS
> measured in the 60 second range
> * 50 threads - Nodes become unresponsive due to full Old Gen Heap. CMS
> measured in the 60 second range (approx 10k writes per second per node)
> * 100 threads - Nodes become unresponsive due to full Old Gen Heap. CMS
> measured in the 60 second range (approx 20k writes per second per node)
> * 200 threads - Nodes become unresponsive due to full Old Gen Heap. CMS
> measured in the 60 second range (approx 25k writes per second per node)
> Note - the observed behavior was the same for all tests except for the single
> threaded test. The single threaded test does not appear to show this
> behavior.
> Tested different GC and Linux OS settings with a focus on the 50 and 200
> thread loads.
> JVM settings tested:
> # default, out of the box, env-sh settings
> # 10 G Max | 1 G New - default env-sh settings
> # 10 G Max | 1 G New - default env-sh settings
> #* JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50"
> # 20 G Max | 10 G New
> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=8"
> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
> JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"
> JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
> JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=60000"
> JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=30000"
> JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=12"
> JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=12"
> JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
> JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
> JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
> JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768"
> JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"
> # 20 G Max | 1 G New
> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=8"
> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
> JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"
> JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
> JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=60000"
> JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=30000"
> JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=12"
> JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=12"
> JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
> JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
> JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
> JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768"
> JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"
> Linux OS settings tested:
> # Disabled Transparent Huge Pages
> echo never > /sys/kernel/mm/transparent_hugepage/enabled
> echo never > /sys/kernel/mm/transparent_hugepage/defrag
> # Enabled Huge Pages
> echo 21500000000 > /proc/sys/kernel/shmmax (over 20GB for heap)
> echo 1536 > /proc/sys/vm/nr_hugepages (20GB/2MB page size)
> # Disabled NUMA
> numa-off in /etc/grub.confdatastax
> # Verified all settings documented here were implemented
>
> http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installRecommendSettings.html
> Attachments:
> # .yaml
> # fio output - results.tar.gz
> # 50 thread heap dump -
> https://drive.google.com/a/datastax.com/file/d/0B4Imdpu2YrEbMGpCZW5ta2liQ2c/view?usp=sharing
> # 100 thread - visual vm anonymous screenshot - visualvm_screenshot
> # dstat screen shot of with compaction - Node_with_compaction.png
> # dstat screen shot of without compaction -- Node_without_compaction.png
> # gcinspector messages from system.log
> # gc.log output - gc.logs.tar.gz
> Observations:
> # even though this is a spinning disk implementation, disk io looks good.
> #* output from Jshook perf monitor https://github.com/jshook/perfscripts is
> attached
> #* note, we leveraged direct io for all tests by adding direct=1 to the
> .global config files
> # cpu usage is moderate until large GC events occur
> # once old gen heap fills up and cannot clean, memtable post flushers start
> to back up (show a lot pending) via tpstats
> # the node itself, i.e. ssh, is still responsive but the Cassandra instance
> becomes unresponsive
> # once old gen heap fills up Cassandra stress starts to throw CL ONE errors
> stating there aren't enough replicas to satisfy....
> # heap dump from 50 thread, JVM scenario 1 is attached
> #* appears to show a compaction thread consuming a lot of memory
> # sample system.log output for gc issues
> # strace -e futex -p $PID -f -c output during 100 thread load and during old
> gen "filling", just before full
> % time seconds usecs/call calls errors syscall
> 100.00 244.886766 4992 49052 7507 futex
> 100.00 244.886766 49052 7507 total
> # htop during full gc cycle -
> https://s3.amazonaws.com/uploads.hipchat.com/6528/480117/4ZlgcoNScb6kRM2/upload.png
> # nothing is blocked via tpstats on these nodes
> # compaction does have pending tasks, upwards of 20, on the nodes
> # Nodes without compaction achieved approximately 20k writes per second per
> node without errors or drops
> Next Steps:
> # Will try to create a flame graph and update load here -
> http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html
> # Will try to recreate in another environment
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)