[jira] [Commented] (CASSANDRA-8447) Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled

Philo Yang (JIRA) Wed, 10 Dec 2014 21:43:38 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242183#comment-14242183
 ]


Philo Yang commented on CASSANDRA-8447:
---------------------------------------

I have the same trouble that full gc can not reduce the size of old gen.  Days 
ago I post this problem to the maillist, people think it will be solved by 
tuning the gc setting, however it doesn't work for me. After seeing this issue 
I think it may be a bug? I can offer some information in my cluster, hope I can 
help you find the bug if it is a bug. Of course, the reason of my trouble may 
be different with [~jlacefie].

The unresponsive only appears in some nodes, in other words, some nodes 
unresponsive several times a day, the other nodes never. When there is no 
trouble, the load for all nodes are same, so I think the unresponsive is not 
because the heavy load.

Before the node being unresponsive, it will be easily to find by jstat that 
after CMS gc, old gen is still very large (in usual it will be only less than 
1GB after full gc but when the trouble comes it will be still more than 4GB 
after CMS gc). And there may be a compaction that stucks for many minutes in 
99% even 99.99% like this:
pending tasks: 1
   compaction type   keyspace      table   completed       total    unit   
progress
        Compaction     keyspace   table   354680703   354710642   bytes     
99.99%
But I'm not sure the trouble is always with the compaction stuck because I 
don't follow all the unresponsive.

I use jmap to print the object that holds in heap, I don't know if it is 
helpful to you:
num     #instances         #bytes  class name
----------------------------------------------
   1:      11899268     3402792016  [B
   2:      23734819     1139271312  java.nio.HeapByteBuffer
   3:      11140273      306165600  [Ljava.nio.ByteBuffer;
   4:       9484838      227636112  
org.apache.cassandra.db.composites.CompoundComposite
   5:       8220604      197294496  
org.apache.cassandra.db.composites.BoundedComposite
   6:         27187       69131928  [J
   7:       1673344       53547008  
org.apache.cassandra.db.composites.CompoundSparseCellName
   8:       1540101       49283232  org.apache.cassandra.db.BufferCell
   9:          3158       45471360  
[Lorg.apache.cassandra.db.composites.Composite;
  10:          2527       27865040  [I
  11:        251797       20236456  [Ljava.lang.Object;
  12:        417752       12899976  [C
  13:        263201       10528040  org.apache.cassandra.db.BufferExpiringCell
  14:        322324       10314368  
com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$Node
  15:        322237       10311584  
com.googlecode.concurrentlinkedhashmap.ConcurrentHashMapV8$Node
  16:        417331       10015944  java.lang.String
  17:         86368        8891280  [Lorg.apache.cassandra.db.Cell;
  18:        349917        8398008  org.apache.cassandra.cql3.ColumnIdentifier
  19:        204161        8166440  java.util.TreeMap$Entry
  20:        322324        7735776  
com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$WeightedValue
  21:        322324        7735776  org.apache.cassandra.cache.KeyCacheKey
  22:        317274        7614576  java.lang.Double
  23:        317154        7611696  org.apache.cassandra.db.RowIndexEntry
  24:        314642        7551408  
java.util.concurrent.ConcurrentSkipListMap$Node
  25:         52560        7316584  <constMethodKlass>
  26:        292136        7011264  java.lang.Long
  27:         52560        6740064  <methodKlass>
  28:          5290        5937512  <constantPoolKlass>
  29:        160281        3846744  org.apache.cassandra.db.BufferDecoratedKey
  30:        155777        3738648  
java.util.concurrent.ConcurrentSkipListMap$Index
  31:          5290        3642232  <instanceKlassKlass>
  32:        150284        3606816  org.apache.cassandra.db.AtomicBTreeColumns
  33:        150261        3606264  
org.apache.cassandra.db.AtomicBTreeColumns$Holder
  34:         87861        3514440  
org.apache.cassandra.db.ArrayBackedSortedColumns
  35:         87768        3510720  
org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask
  36:          4634        3372096  <constantPoolCacheKlass>
  37:         84262        3370480  java.util.Collections$SingletonMap
  38:          6243        2778728  <methodDataKlass>
  39:        173490        2775840  org.apache.cassandra.dht.LongToken
  40:         82000        2624000  java.util.RegularEnumSet
  41:         81981        2623392  org.apache.cassandra.net.MessageIn
  42:         81980        2623360  org.apache.cassandra.net.MessageDeliveryTask
  43:        102511        2460264  
java.util.concurrent.ConcurrentLinkedQueue$Node
  44:         94901        2277624  org.apache.cassandra.db.DeletionInfo
  45:         93837        2252088  
java.util.concurrent.Executors$RunnableAdapter
  46:        140525        2248400  
org.apache.cassandra.db.composites.SimpleDenseCellName
  47:         87753        2106072  org.apache.cassandra.db.Mutation
  48:             1        2097168  
[Lcom.googlecode.concurrentlinkedhashmap.ConcurrentHashMapV8$Node;
  49:         82130        1971120  java.net.InetAddress$InetAddressHolder
  50:         82127        1971048  java.net.Inet4Address

I jstack the jvm, there is no deadlock and I can not find any porblem. I post 
it to https://gist.github.com/yangzhe1991/755ea2a10520be1fe59a

Hope to help you solve this issue:)

> Nodes stuck in CMS GC cycle with very little traffic when compaction is 
> enabled
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-8447
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8447
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: Cluster size - 4 nodes
> Node size - 12 CPU (hyper threaded to 24 cores), 192 GB RAM, 2 Raid 0 arrays 
> (Data - 10 disk, spinning 10k drives | CL 2 disk, spinning 10k drives)
> OS - RHEL 6.5
> jvm - oracle 1.7.0_71
> Cassandra version 2.0.11
>            Reporter: jonathan lacefield
>         Attachments: Node_with_compaction.png, Node_without_compaction.png, 
> cassandra.yaml, gc.logs.tar.gz, gcinspector_messages.txt, memtable_debug, 
> results.tar.gz, visualvm_screenshot
>
>
> Behavior - If autocompaction is enabled, nodes will become unresponsive due 
> to a full Old Gen heap which is not cleared during CMS GC.
> Test methodology - disabled autocompaction on 3 nodes, left autocompaction 
> enabled on 1 node.  Executed different Cassandra stress loads, using write 
> only operations.  Monitored visualvm and jconsole for heap pressure.  
> Captured iostat and dstat for most tests.  Captured heap dump from 50 thread 
> load.  Hints were disabled for testing on all nodes to alleviate GC noise due 
> to hints backing up.
> Data load test through Cassandra stress -  /usr/bin/cassandra-stress  write 
> n=1900000000 -rate threads=<different threads tested> -schema  
> replication\(factor=3\)  keyspace="Keyspace1" -node <all nodes listed>
> Data load thread count and results:
> * 1 thread - Still running but looks like the node can sustain this load 
> (approx 500 writes per second per node)
> * 5 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
> measured in the 60 second range (approx 2k writes per second per node)
> * 10 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
> measured in the 60 second range
> * 50 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
> measured in the 60 second range  (approx 10k writes per second per node)
> * 100 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
> measured in the 60 second range  (approx 20k writes per second per node)
> * 200 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
> measured in the 60 second range  (approx 25k writes per second per node)
> Note - the observed behavior was the same for all tests except for the single 
> threaded test.  The single threaded test does not appear to show this 
> behavior.
> Tested different GC and Linux OS settings with a focus on the 50 and 200 
> thread loads.  
> JVM settings tested:
> #  default, out of the box, env-sh settings
> #  10 G Max | 1 G New - default env-sh settings
> #  10 G Max | 1 G New - default env-sh settings
> #* JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50"
> #   20 G Max | 10 G New 
>    JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
>    JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
>    JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
>    JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
>    JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=8"
>    JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
>    JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
>    JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"
>    JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
>    JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=60000"
>    JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=30000"
>    JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=12"
>    JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=12"
>    JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
>    JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
>    JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
>    JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768"
>    JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"
> # 20 G Max | 1 G New 
>    JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
>    JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
>    JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
>    JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
>    JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=8"
>    JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
>    JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
>    JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"
>    JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
>    JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=60000"
>    JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=30000"
>    JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=12"
>    JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=12"
>    JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
>    JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
>    JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
>    JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768"
>    JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"
> Linux OS settings tested:
> # Disabled Transparent Huge Pages
> echo never > /sys/kernel/mm/transparent_hugepage/enabled
> echo never > /sys/kernel/mm/transparent_hugepage/defrag
> # Enabled Huge Pages
> echo 21500000000 > /proc/sys/kernel/shmmax (over 20GB for heap)
> echo 1536 > /proc/sys/vm/nr_hugepages (20GB/2MB page size)
> # Disabled NUMA
> numa-off in /etc/grub.confdatastax
> # Verified all settings documented here were implemented
>   
> http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installRecommendSettings.html
> Attachments:
> #  .yaml
> #  fio output - results.tar.gz
> #  50 thread heap dump - 
> https://drive.google.com/a/datastax.com/file/d/0B4Imdpu2YrEbMGpCZW5ta2liQ2c/view?usp=sharing
> #  100 thread - visual vm anonymous screenshot - visualvm_screenshot
> #  dstat screen shot of with compaction - Node_with_compaction.png
> #  dstat screen shot of without compaction -- Node_without_compaction.png
> #  gcinspector messages from system.log
> # gc.log output - gc.logs.tar.gz
> Observations:
> #  even though this is a spinning disk implementation, disk io looks good. 
> #* output from Jshook perf monitor https://github.com/jshook/perfscripts is 
> attached
> #* note, we leveraged direct io for all tests by adding direct=1 to the 
> .global config files
> #  cpu usage is moderate until large GC events occur
> #  once old gen heap fills up and cannot clean, memtable post flushers start 
> to back up (show a lot pending) via tpstats
> #  the node itself, i.e. ssh, is still responsive but the Cassandra instance 
> becomes unresponsive
> # once old gen heap fills up Cassandra stress starts to throw CL ONE errors 
> stating there aren't enough replicas to satisfy....
> #  heap dump from 50 thread, JVM scenario 1 is attached
> #* appears to show a compaction thread consuming a lot of memory
> #  sample system.log output for gc issues
> #  strace -e futex -p $PID -f -c output during 100 thread load and during old 
> gen "filling", just before full
> % time    seconds  usecs/call    calls    errors syscall
> 100.00  244.886766        4992    49052      7507 futex
> 100.00  244.886766                49052      7507 total
> #  htop during full gc cycle  - 
> https://s3.amazonaws.com/uploads.hipchat.com/6528/480117/4ZlgcoNScb6kRM2/upload.png
> #  nothing is blocked via tpstats on these nodes
> #  compaction does have pending tasks, upwards of 20, on the nodes
> #  Nodes without compaction achieved approximately 20k writes per second per 
> node without errors or drops
> Next Steps:
> #  Will try to create a flame graph and update load here - 
> http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html
> #  Will try to recreate in another environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8447) Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled

Reply via email to