[jira] [Comment Edited] (CASSANDRA-13365) Nodes entering GC loop, does not recover

Mina Naguib (JIRA) Wed, 22 Mar 2017 07:43:17 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-13365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15936377#comment-15936377
 ]


Mina Naguib edited comment on CASSANDRA-13365 at 3/22/17 2:41 PM:
------------------------------------------------------------------

For the time being, we've mitigated by adding a script on each node that 
detects it's entered a GC loop (at least 5 logged "Full GC (Allocation 
Failure)" errors in 5 minutes) and force-restarts the node (we don't have the 
luxury of asking it for a proper drain and soft shutdown).

I'd appreciate any ideas for further investigation towards a proper fix.


was (Author: minaguib):
For the time being, we've mitigated by adding a script on each node that 
detects it's entered a GC loop (at least 5 logged "Full GC (Allocation 
Failure)" errors in 5 minutes) and force-restarts the node.

I'd appreciate any ideas for further investigation towards a proper fix.

> Nodes entering GC loop, does not recover
> ----------------------------------------
>
>                 Key: CASSANDRA-13365
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13365
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: 34-node cluster over 4 DCs
> Linux CentOS 7.2 x86
> Mix of 64GB/128GB RAM / node
> Mix of 32/40 hardware threads / node, Xeon ~2.4Ghz
> High read volume, low write volume, occasional sstable bulk loading
>            Reporter: Mina Naguib
>
> Over the last week we've been observing two related problems affecting our 
> Cassandra cluster
> Problem 1: 1-few nodes per DC entering GC loop, not recovering
> Checking the heap usage stats, there's a sudden jump of 1-3GB. Some nodes 
> recover, but some don't and log this:
> {noformat}
> 2017-03-21T11:23:02.957-0400: 54099.519: [Full GC (Allocation Failure)  
> 13G->11G(14G), 29.4127307 secs]
> 2017-03-21T11:23:45.270-0400: 54141.833: [Full GC (Allocation Failure)  
> 13G->12G(14G), 28.1561881 secs]
> 2017-03-21T11:24:20.307-0400: 54176.869: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.7019501 secs]
> 2017-03-21T11:24:50.528-0400: 54207.090: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.1372267 secs]
> 2017-03-21T11:25:19.190-0400: 54235.752: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.0703975 secs]
> 2017-03-21T11:25:46.711-0400: 54263.273: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.3187768 secs]
> 2017-03-21T11:26:15.419-0400: 54291.981: [Full GC (Allocation Failure)  
> 13G->13G(14G), 26.9493405 secs]
> 2017-03-21T11:26:43.399-0400: 54319.961: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.5222085 secs]
> 2017-03-21T11:27:11.383-0400: 54347.945: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.1769581 secs]
> 2017-03-21T11:27:40.174-0400: 54376.737: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.4639031 secs]
> 2017-03-21T11:28:08.946-0400: 54405.508: [Full GC (Allocation Failure)  
> 13G->13G(14G), 30.3480523 secs]
> 2017-03-21T11:28:40.117-0400: 54436.680: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.8220513 secs]
> 2017-03-21T11:29:08.459-0400: 54465.022: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.4691271 secs]
> 2017-03-21T11:29:37.114-0400: 54493.676: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.0275733 secs]
> 2017-03-21T11:30:04.635-0400: 54521.198: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.1902627 secs]
> 2017-03-21T11:30:32.114-0400: 54548.676: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.8872850 secs]
> 2017-03-21T11:31:01.430-0400: 54577.993: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.1609706 secs]
> 2017-03-21T11:31:29.024-0400: 54605.587: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.3635138 secs]
> 2017-03-21T11:31:57.303-0400: 54633.865: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.4143510 secs]
> 2017-03-21T11:32:25.110-0400: 54661.672: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.8595986 secs]
> 2017-03-21T11:32:53.922-0400: 54690.485: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.5242543 secs]
> 2017-03-21T11:33:21.867-0400: 54718.429: [Full GC (Allocation Failure)  
> 13G->13G(14G), 30.8930130 secs]
> 2017-03-21T11:33:53.712-0400: 54750.275: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.6523013 secs]
> 2017-03-21T11:34:21.760-0400: 54778.322: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.3030198 secs]
> 2017-03-21T11:34:50.073-0400: 54806.635: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.1594154 secs]
> 2017-03-21T11:35:17.743-0400: 54834.306: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.3766949 secs]
> 2017-03-21T11:35:45.797-0400: 54862.360: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.5756770 secs]
> 2017-03-21T11:36:13.816-0400: 54890.378: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.5541813 secs]
> 2017-03-21T11:36:41.926-0400: 54918.488: [Full GC (Allocation Failure)  
> 13G->13G(14G), 33.7510103 secs]
> 2017-03-21T11:37:16.132-0400: 54952.695: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.4856611 secs]
> 2017-03-21T11:37:44.454-0400: 54981.017: [Full GC (Allocation Failure)  
> 13G->13G(14G), 28.1269335 secs]
> 2017-03-21T11:38:12.774-0400: 55009.337: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.7830448 secs]
> 2017-03-21T11:38:40.840-0400: 55037.402: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.3527326 secs]
> 2017-03-21T11:39:08.610-0400: 55065.173: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.5828941 secs]
> 2017-03-21T11:39:36.833-0400: 55093.396: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.9303030 secs]
> 2017-03-21T11:40:05.265-0400: 55121.828: [Full GC (Allocation Failure)  
> 13G->13G(14G), 36.9902867 secs]
> 2017-03-21T11:40:42.400-0400: 55158.963: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.6835744 secs]
> 2017-03-21T11:41:10.529-0400: 55187.091: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.1899555 secs]
> 2017-03-21T11:41:38.018-0400: 55214.581: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.7309706 secs]
> 2017-03-21T11:42:06.062-0400: 55242.624: [Full GC (Allocation Failure)  
> 13G->13G(14G), 29.2103234 secs]
> 2017-03-21T11:42:35.579-0400: 55272.141: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.9902741 secs]
> 2017-03-21T11:43:03.955-0400: 55300.518: [Full GC (Allocation Failure)  
> 13G->13G(14G), 28.0764545 secs]
> 2017-03-21T11:43:32.186-0400: 55328.748: [Full GC (Allocation Failure)  
> 13G->13G(14G), 28.3913296 secs]
> 2017-03-21T11:44:00.810-0400: 55357.373: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.0945131 secs]
> 2017-03-21T11:44:28.343-0400: 55384.905: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.3357763 secs]
> 2017-03-21T11:44:55.909-0400: 55412.471: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.9216626 secs]
> 2017-03-21T11:45:24.070-0400: 55440.632: [Full GC (Allocation Failure)  
> 13G->13G(14G), 29.8189953 secs]
> 2017-03-21T11:45:54.092-0400: 55470.654: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.9992085 secs]
> 2017-03-21T11:46:22.219-0400: 55498.782: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.4622100 secs]
> 2017-03-21T11:46:49.983-0400: 55526.545: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.9159143 secs]
> 2017-03-21T11:47:18.029-0400: 55554.592: [Full GC (Allocation Failure)  
> 13G->13G(14G), 28.5207919 secs]
> 2017-03-21T11:47:46.938-0400: 55583.500: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.0355291 secs]
> 2017-03-21T11:48:14.235-0400: 55610.798: [Full GC (Allocation Failure)  
> 13G->13G(14G), 35.6284865 secs]
> 2017-03-21T11:48:50.013-0400: 55646.575: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.3299061 secs]
> 2017-03-21T11:49:17.592-0400: 55674.154: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.2544348 secs]
> {noformat}
> When a node enters the above state:, it becomes useless over 1-5 minutes:
> * Client connections die
> * New connections time out
> * CPU usage drops - all threads do nothing except for 1 thread near 100% CPU 
> ("VM Thread" os_prio=0 tid=0x00007ffaf8209bf0 nid=0x2d53 runnable"), doing 
> the above GC in a loop
> * JMX becomes unqueriable
> * Regular kill signal is ignored (needs kill -9)
> I've managed to extract a full heap dump of a node while in this state, but 
> it's 20GB and I have not been able to find a tool that can analyze it.
> I've extracted a histogram of heap objects.  From a healthy node:
> {noformat}
>  num     #instances         #bytes  class name
> ----------------------------------------------
>    1:      12688007     1116544616  io.netty.buffer.PooledUnsafeDirectByteBuf
>    2:      24938940      798046080  io.netty.util.Recycler$DefaultHandle
>    3:      11457548      733283072  
> io.netty.channel.ChannelOutboundBuffer$Entry
>    4:       9582138      459942624  java.nio.HeapByteBuffer
>    5:         16974      393432416  [I
>    6:       2014719      311356992  [B
>    7:       4982167      181324848  [Ljava.lang.Object;
>    8:          8531      118782704  [Lio.netty.util.Recycler$DefaultHandle;
>    9:       2827662      113106480  org.apache.cassandra.db.rows.BufferCell
>   10:       2737816       87610112  org.apache.cassandra.db.rows.BTreeRow
>   11:       2740110       70228240  [Ljava.nio.ByteBuffer;
>   12:       2738081       65713944  org.apache.cassandra.db.BufferClustering
>   13:       1763956       56446592  
> com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$Node
>   14:       1763736       56439552  
> com.googlecode.concurrentlinkedhashmap.ConcurrentHashMapV8$Node
>   15:       2205825       52939800  
> com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$WeightedValue
>   16:       1617026       51744832  
> java.util.concurrent.ConcurrentHashMap$Node
>   17:       1769934       42478416  org.apache.cassandra.cache.KeyCacheKey
> {noformat}
> From an unhealthy node:
> {noformat}
>  num     #instances         #bytes  class name
> ----------------------------------------------
>    1:     103726026     4978849248  java.nio.HeapByteBuffer
>    2:      95646048     2832233872  [B
>    3:      28076465     1207916952  [Ljava.lang.Object;
>    4:      12803448     1126703424  io.netty.buffer.PooledUnsafeDirectByteBuf
>    5:      25054634      801748288  io.netty.util.Recycler$DefaultHandle
>    6:      11457548      733283072  
> io.netty.channel.ChannelOutboundBuffer$Entry
>    7:      24725380      593409120  java.util.ArrayList
>    8:       5808290      185865280  java.util.RegularEnumSet
>    9:       2870889      160769784  
> org.apache.cassandra.transport.messages.ResultMessage$Rows
>   10:       2937336      140992128  io.netty.buffer.SlicedAbstractByteBuf
>   11:          8854      118773984  [Lio.netty.util.Recycler$DefaultHandle;
>   12:       2830805      113232200  org.apache.cassandra.db.rows.BufferCell
>   13:       2937336       93994752  
> org.apache.cassandra.transport.Frame$Header
>   14:       2870928       91869696  
> org.apache.cassandra.cql3.ResultSet$ResultMetadata
>   15:       2728627       87316064  org.apache.cassandra.db.rows.BTreeRow
>   16:       2947033       70728792  
> java.util.concurrent.ConcurrentLinkedQueue$Node
>   17:       2937336       70496064  org.apache.cassandra.transport.Frame
> {noformat}
> The increase is clear in these top 3 object types:
> {noformat}
>    1:     103726026     4978849248  java.nio.HeapByteBuffer
>    2:      95646048     2832233872  [B
>    3:      28076465     1207916952  [Ljava.lang.Object;
> {noformat}
> Other related observations:
> Client socket states: Essentially the JVM has stopped working, and the 
> recv-queue is piling up kernel-side but not being consumed app-side.  Worth 
> noting that the peers (clients) have given up and closed the connections on 
> their end, but they still appear ESTABlished on cassandra's side since it 
> hasn't drain the read buffer till EOF/error/complete the closure:
> {noformat}
> ESTAB      558612 0      172.30.72.133:9042               172.30.72.145:11418
> ESTAB      564172 0      172.30.72.133:9042               172.30.72.145:12420
> ESTAB      545284 0      172.30.72.133:9042               172.30.72.145:18212
> ESTAB      565468 0      172.30.72.133:9042               172.30.72.145:18346
> ESTAB      525884 0      172.30.72.133:9042               172.30.72.145:18352
> ESTAB      532324 0      172.30.72.133:9042               172.30.72.145:28896
> ESTAB      517592 0      172.30.72.133:9042               172.30.72.145:34504
> ESTAB      553392 0      172.30.72.133:9042               172.30.72.145:36926
> ESTAB      581952 0      172.30.72.133:9042               172.30.72.145:40684
> ESTAB      543668 0      172.30.72.133:9042               172.30.72.145:42282
> ESTAB      511860 0      172.30.72.133:9042               172.30.72.145:47350
> ESTAB      590380 0      172.30.72.133:9042               172.30.72.145:47566
> ESTAB      583120 0      172.30.72.133:9042               172.30.72.145:50386
> ESTAB      536772 0      172.30.72.133:9042               172.30.72.145:52656
> ESTAB      31     0      172.30.72.133:9042               172.30.72.145:52946
> ESTAB      532140 0      172.30.72.133:9042               172.30.72.145:53618
> ESTAB      14397  0      172.30.72.133:9042               172.30.72.150:10016
> ESTAB      6444   0      172.30.72.133:9042               172.30.72.150:12176
> ESTAB      7568   0      172.30.72.133:9042               172.30.72.150:12934
> ESTAB      139479 0      172.30.72.133:9042               172.30.72.150:13186
> ESTAB      6896   0      172.30.72.133:9042               172.30.72.150:17316
> ESTAB      7808   0      172.30.72.133:9042               172.30.72.150:17448
> ESTAB      8712   0      172.30.72.133:9042               172.30.72.150:17748
> ESTAB      6944   0      172.30.72.133:9042               172.30.72.150:18472
> ESTAB      136700 0      172.30.72.133:9042               172.30.72.150:19860
> ESTAB      16260  0      172.30.72.133:9042               172.30.72.150:20696
> ESTAB      8943   0      172.30.72.133:9042               172.30.72.150:21208
> ESTAB      132877 0      172.30.72.133:9042               172.30.72.150:21420
> ESTAB      6936   0      172.30.72.133:9042               172.30.72.150:25440
> ESTAB      92     0      172.30.72.133:9042               172.30.72.150:25778
> ESTAB      137312 0      172.30.72.133:9042               172.30.72.150:28258
> ESTAB      40     0      172.30.72.133:9042               172.30.72.150:28348
> ESTAB      6940   0      172.30.72.133:9042               172.30.72.150:29718
> ESTAB      6416   0      172.30.72.133:9042               172.30.72.150:29858
> ESTAB      6580   0      172.30.72.133:9042               172.30.72.150:30518
> ESTAB      4202   0      172.30.72.133:9042               172.30.72.150:30948
> ESTAB      6060   0      172.30.72.133:9042               172.30.72.150:35276
> ESTAB      0      0      172.30.72.133:9042               172.30.72.150:40114
> ESTAB      134303 0      172.30.72.133:9042               172.30.72.150:42212
> ESTAB      396    0      172.30.72.133:9042               172.30.72.150:44848
> ESTAB      136450 0      172.30.72.133:9042               172.30.72.150:47742
> ESTAB      8318   0      172.30.72.133:9042               172.30.72.150:51054
> ESTAB      11976  0      172.30.72.133:9042               172.30.72.150:54560
> ESTAB      136193 0      172.30.72.133:9042               172.30.72.150:56094
> ESTAB      138737 0      172.30.72.133:9042               172.30.72.150:60382
> ESTAB      7996   0      172.30.72.133:9042               172.30.72.150:60932
> ESTAB      137235 0      172.30.72.133:9042               172.30.72.150:61568
> ESTAB      8184   0      172.30.72.133:9042               172.30.72.150:63054
> ESTAB      985276 0      172.30.72.133:9042               172.30.72.175:23494
> ESTAB      601528 0      172.30.72.133:9042               172.30.72.175:24140
> ESTAB      557844 0      172.30.72.133:9042               172.30.72.175:42248
> ESTAB      469344 0      172.30.72.133:9042               172.30.72.175:59704
> ESTAB      507160 0      172.30.72.133:9042               172.30.72.185:11958
> ESTAB      449196 0      172.30.72.133:9042               172.30.72.185:15458
> ESTAB      487796 0      172.30.72.133:9042               172.30.72.185:16458
> ESTAB      441132 0      172.30.72.133:9042               172.30.72.185:23194
> ESTAB      486664 0      172.30.72.133:9042               172.30.72.185:25276
> ESTAB      554780 0      172.30.72.133:9042               172.30.72.185:29102
> ESTAB      496248 0      172.30.72.133:9042               172.30.72.185:29428
> ESTAB      420048 0      172.30.72.133:9042               172.30.72.185:39650
> ESTAB      498704 0      172.30.72.133:9042               172.30.72.185:46424
> ESTAB      506488 0      172.30.72.133:9042               172.30.72.185:49722
> ESTAB      509416 0      172.30.72.133:9042               172.30.72.185:55838
> ESTAB      511956 0      172.30.72.133:9042               172.30.72.185:57670
> ESTAB      496720 0      172.30.72.133:9042               172.30.72.185:59884
> {noformat}
> perf top during this state:
> {noformat}
>   19.73%  libjvm.so                                              [.] 
> InstanceKlass::oop_follow_contents
>   14.05%  libjvm.so                                              [.] 
> ObjArrayKlass::oop_follow_contents
>    7.83%  libjvm.so                                              [.] 
> MarkSweep::follow_stack
>    6.38%  libjvm.so                                              [.] 
> HeapRegion::block_size
>    5.41%  libjvm.so                                              [.] 
> OtherRegionsTable::add_reference
>    5.12%  libjvm.so                                              [.] 
> UpdateRSOopClosure::do_oop
>    4.77%  libjvm.so                                              [.] 
> HeapRegion::block_is_obj
> {noformat}
> Frequency: Affecting nodes in our 2 busier DCs (other 2 DCs serve less 
> volume).  ~ 2-10 times per day.  Not always same nodes (though lately 2 
> hiccup more frequently than others).
> Tried:
> * Upgrading cassandra from 3.9 to 3.10
> * Increasing heap size from 8GB to 14GB
> * Upgrading JVM 1.8 from u112 to u121
> * Switching from CMS to G1GC
> The last change, switching from CMS to G1GC reduced the frequency of the 
> problem but has not eliminated it.
> Problem 2: Read volume in DC depressed
> With our RF set to 2 per DC, if 2 or more nodes in a DC enter the above GC 
> loop described in problem 1, all nodes in the same DC accumulate very high 
> active/pending/blocked Task-Transport queues.
> Total read volume for the DC drops to 10%-30% of normal.
> Tried:
> * Increasing -Dcassandra.max_queued_native_transport_requests to 4096
> * Reducing read_request_timeout_in_ms to 400
> * Reducing write_request_timeout_in_ms to 800
> Improved things slightly but not too much.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (CASSANDRA-13365) Nodes entering GC loop, does not recover

Reply via email to