[jira] [Comment Edited] (CASSANDRA-8447) Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled

jonathan lacefield (JIRA) Thu, 11 Dec 2014 10:59:58 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242954#comment-14242954
 ]


jonathan lacefield edited comment on CASSANDRA-8447 at 12/11/14 6:58 PM:
-------------------------------------------------------------------------

[~benedict]  Interesting about hints.  Just verified hints on the cluster.  
*  CQLSH shows 0 for count
*  data directory locally is empty under hints for all nodes.  
*  For all "healthy" nodes, tpstats shows no pending/active hints.
*  For the "unhealthy" node, TPstats shows 2 Active and 3 Pending hint ops  

>From "unhealthy" node
cqlsh> use system;
cqlsh:system> select count(*) from hints ;
 count
-------
     0

Pool Name                    Active   Pending      Completed   Blocked  All 
time blocked
ReadStage                         0         0              2         0          
       0
RequestResponseStage              0         0              9         0          
       0
MutationStage                     0         0       16471703         0          
       0
ReadRepairStage                   0         0              0         0          
       0
ReplicateOnWriteStage             0         0              0         0          
       0
GossipStage                       0         0            439         0          
       0
CacheCleanupExecutor              0         0              0         0          
       0
MigrationStage                    0         0              0         0          
       0
MemoryMeter                       0         0             24         0          
       0
FlushWriter                       0         0            175         0          
       0
ValidationExecutor                0         0              0         0          
       0
InternalResponseStage             0         0              0         0          
       0
AntiEntropyStage                  0         0              0         0          
       0
MemtablePostFlusher               0         0            194         0          
       0
MiscStage                         0         0              0         0          
       0
PendingRangeCalculator            0         0              6         0          
       0
CompactionExecutor                1        17             18         0          
       0
commitlog_archiver                0         0              0         0          
       0
HintedHandoff                     2         3              0         0          
       0

Here is the excerpt from the current hints config items in the .yaml from all 4 
nodes
hinted_handoff_enabled: false
# this defines the maximum amount of time a dead host will have hints
# generated.  After it has been dead this long, new hints for it will not be
max_hint_window_in_ms: 10800000 # 3 hours
# since we expect two nodes to be delivering hints simultaneously.)
hinted_handoff_throttle_in_kb: 1024
# Number of threads with which to deliver hints;
max_hints_delivery_threads: 2

(edited - even after restarting dse, the "unhealthy" node shows 2 active and 3 
pending hints via tpstats)


was (Author: jlacefie):
[~benedict]  Interesting about hints.  Just verified hints on the cluster.  
*  CQLSH shows 0 for count
*  data directory locally is empty under hints for all nodes.  
*  For all "healthy" nodes, tpstats shows no pending/active hints.
*  For the "unhealthy" node, TPstats shows 2 Active and 3 Pending hint ops  

>From "unhealthy" node
cqlsh> use system;
cqlsh:system> select count(*) from hints ;
 count
-------
     0

Pool Name                    Active   Pending      Completed   Blocked  All 
time blocked
ReadStage                         0         0              2         0          
       0
RequestResponseStage              0         0              9         0          
       0
MutationStage                     0         0       16471703         0          
       0
ReadRepairStage                   0         0              0         0          
       0
ReplicateOnWriteStage             0         0              0         0          
       0
GossipStage                       0         0            439         0          
       0
CacheCleanupExecutor              0         0              0         0          
       0
MigrationStage                    0         0              0         0          
       0
MemoryMeter                       0         0             24         0          
       0
FlushWriter                       0         0            175         0          
       0
ValidationExecutor                0         0              0         0          
       0
InternalResponseStage             0         0              0         0          
       0
AntiEntropyStage                  0         0              0         0          
       0
MemtablePostFlusher               0         0            194         0          
       0
MiscStage                         0         0              0         0          
       0
PendingRangeCalculator            0         0              6         0          
       0
CompactionExecutor                1        17             18         0          
       0
commitlog_archiver                0         0              0         0          
       0
HintedHandoff                     2         3              0         0          
       0

Here is the excerpt from the current hints config items in the .yaml from all 4 
nodes
hinted_handoff_enabled: false
# this defines the maximum amount of time a dead host will have hints
# generated.  After it has been dead this long, new hints for it will not be
max_hint_window_in_ms: 10800000 # 3 hours
# since we expect two nodes to be delivering hints simultaneously.)
hinted_handoff_throttle_in_kb: 1024
# Number of threads with which to deliver hints;
max_hints_delivery_threads: 2

> Nodes stuck in CMS GC cycle with very little traffic when compaction is 
> enabled
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-8447
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8447
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: Cluster size - 4 nodes
> Node size - 12 CPU (hyper threaded to 24 cores), 192 GB RAM, 2 Raid 0 arrays 
> (Data - 10 disk, spinning 10k drives | CL 2 disk, spinning 10k drives)
> OS - RHEL 6.5
> jvm - oracle 1.7.0_71
> Cassandra version 2.0.11
>            Reporter: jonathan lacefield
>         Attachments: Node_with_compaction.png, Node_without_compaction.png, 
> cassandra.yaml, gc.logs.tar.gz, gcinspector_messages.txt, memtable_debug, 
> output.1.svg, output.2.svg, output.svg, results.tar.gz, visualvm_screenshot
>
>
> Behavior - If autocompaction is enabled, nodes will become unresponsive due 
> to a full Old Gen heap which is not cleared during CMS GC.
> Test methodology - disabled autocompaction on 3 nodes, left autocompaction 
> enabled on 1 node.  Executed different Cassandra stress loads, using write 
> only operations.  Monitored visualvm and jconsole for heap pressure.  
> Captured iostat and dstat for most tests.  Captured heap dump from 50 thread 
> load.  Hints were disabled for testing on all nodes to alleviate GC noise due 
> to hints backing up.
> Data load test through Cassandra stress -  /usr/bin/cassandra-stress  write 
> n=1900000000 -rate threads=<different threads tested> -schema  
> replication\(factor=3\)  keyspace="Keyspace1" -node <all nodes listed>
> Data load thread count and results:
> * 1 thread - Still running but looks like the node can sustain this load 
> (approx 500 writes per second per node)
> * 5 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
> measured in the 60 second range (approx 2k writes per second per node)
> * 10 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
> measured in the 60 second range
> * 50 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
> measured in the 60 second range  (approx 10k writes per second per node)
> * 100 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
> measured in the 60 second range  (approx 20k writes per second per node)
> * 200 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
> measured in the 60 second range  (approx 25k writes per second per node)
> Note - the observed behavior was the same for all tests except for the single 
> threaded test.  The single threaded test does not appear to show this 
> behavior.
> Tested different GC and Linux OS settings with a focus on the 50 and 200 
> thread loads.  
> JVM settings tested:
> #  default, out of the box, env-sh settings
> #  10 G Max | 1 G New - default env-sh settings
> #  10 G Max | 1 G New - default env-sh settings
> #* JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50"
> #   20 G Max | 10 G New 
>    JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
>    JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
>    JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
>    JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
>    JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=8"
>    JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
>    JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
>    JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"
>    JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
>    JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=60000"
>    JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=30000"
>    JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=12"
>    JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=12"
>    JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
>    JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
>    JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
>    JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768"
>    JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"
> # 20 G Max | 1 G New 
>    JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
>    JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
>    JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
>    JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
>    JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=8"
>    JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
>    JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
>    JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"
>    JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
>    JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=60000"
>    JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=30000"
>    JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=12"
>    JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=12"
>    JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
>    JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
>    JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
>    JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768"
>    JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"
> Linux OS settings tested:
> # Disabled Transparent Huge Pages
> echo never > /sys/kernel/mm/transparent_hugepage/enabled
> echo never > /sys/kernel/mm/transparent_hugepage/defrag
> # Enabled Huge Pages
> echo 21500000000 > /proc/sys/kernel/shmmax (over 20GB for heap)
> echo 1536 > /proc/sys/vm/nr_hugepages (20GB/2MB page size)
> # Disabled NUMA
> numa-off in /etc/grub.confdatastax
> # Verified all settings documented here were implemented
>   
> http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installRecommendSettings.html
> Attachments:
> #  .yaml
> #  fio output - results.tar.gz
> #  50 thread heap dump - 
> https://drive.google.com/a/datastax.com/file/d/0B4Imdpu2YrEbMGpCZW5ta2liQ2c/view?usp=sharing
> #  100 thread - visual vm anonymous screenshot - visualvm_screenshot
> #  dstat screen shot of with compaction - Node_with_compaction.png
> #  dstat screen shot of without compaction -- Node_without_compaction.png
> #  gcinspector messages from system.log
> # gc.log output - gc.logs.tar.gz
> Observations:
> #  even though this is a spinning disk implementation, disk io looks good. 
> #* output from Jshook perf monitor https://github.com/jshook/perfscripts is 
> attached
> #* note, we leveraged direct io for all tests by adding direct=1 to the 
> .global config files
> #  cpu usage is moderate until large GC events occur
> #  once old gen heap fills up and cannot clean, memtable post flushers start 
> to back up (show a lot pending) via tpstats
> #  the node itself, i.e. ssh, is still responsive but the Cassandra instance 
> becomes unresponsive
> # once old gen heap fills up Cassandra stress starts to throw CL ONE errors 
> stating there aren't enough replicas to satisfy....
> #  heap dump from 50 thread, JVM scenario 1 is attached
> #* appears to show a compaction thread consuming a lot of memory
> #  sample system.log output for gc issues
> #  strace -e futex -p $PID -f -c output during 100 thread load and during old 
> gen "filling", just before full
> % time    seconds  usecs/call    calls    errors syscall
> 100.00  244.886766        4992    49052      7507 futex
> 100.00  244.886766                49052      7507 total
> #  htop during full gc cycle  - 
> https://s3.amazonaws.com/uploads.hipchat.com/6528/480117/4ZlgcoNScb6kRM2/upload.png
> #  nothing is blocked via tpstats on these nodes
> #  compaction does have pending tasks, upwards of 20, on the nodes
> #  Nodes without compaction achieved approximately 20k writes per second per 
> node without errors or drops
> Next Steps:
> #  Will try to create a flame graph and update load here - 
> http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html
> #  Will try to recreate in another environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-8447) Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled

Reply via email to