[
https://issues.apache.org/jira/browse/CASSANDRA-13365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15936377#comment-15936377
]
Mina Naguib edited comment on CASSANDRA-13365 at 3/22/17 2:41 PM:
------------------------------------------------------------------
For the time being, we've mitigated by adding a script on each node that
detects it's entered a GC loop (at least 5 logged "Full GC (Allocation
Failure)" errors in 5 minutes) and force-restarts the node (we don't have the
luxury of asking it for a proper drain and soft shutdown).
I'd appreciate any ideas for further investigation towards a proper fix.
was (Author: minaguib):
For the time being, we've mitigated by adding a script on each node that
detects it's entered a GC loop (at least 5 logged "Full GC (Allocation
Failure)" errors in 5 minutes) and force-restarts the node.
I'd appreciate any ideas for further investigation towards a proper fix.
> Nodes entering GC loop, does not recover
> ----------------------------------------
>
> Key: CASSANDRA-13365
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13365
> Project: Cassandra
> Issue Type: Bug
> Environment: 34-node cluster over 4 DCs
> Linux CentOS 7.2 x86
> Mix of 64GB/128GB RAM / node
> Mix of 32/40 hardware threads / node, Xeon ~2.4Ghz
> High read volume, low write volume, occasional sstable bulk loading
> Reporter: Mina Naguib
>
> Over the last week we've been observing two related problems affecting our
> Cassandra cluster
> Problem 1: 1-few nodes per DC entering GC loop, not recovering
> Checking the heap usage stats, there's a sudden jump of 1-3GB. Some nodes
> recover, but some don't and log this:
> {noformat}
> 2017-03-21T11:23:02.957-0400: 54099.519: [Full GC (Allocation Failure)
> 13G->11G(14G), 29.4127307 secs]
> 2017-03-21T11:23:45.270-0400: 54141.833: [Full GC (Allocation Failure)
> 13G->12G(14G), 28.1561881 secs]
> 2017-03-21T11:24:20.307-0400: 54176.869: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.7019501 secs]
> 2017-03-21T11:24:50.528-0400: 54207.090: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.1372267 secs]
> 2017-03-21T11:25:19.190-0400: 54235.752: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.0703975 secs]
> 2017-03-21T11:25:46.711-0400: 54263.273: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.3187768 secs]
> 2017-03-21T11:26:15.419-0400: 54291.981: [Full GC (Allocation Failure)
> 13G->13G(14G), 26.9493405 secs]
> 2017-03-21T11:26:43.399-0400: 54319.961: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.5222085 secs]
> 2017-03-21T11:27:11.383-0400: 54347.945: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.1769581 secs]
> 2017-03-21T11:27:40.174-0400: 54376.737: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.4639031 secs]
> 2017-03-21T11:28:08.946-0400: 54405.508: [Full GC (Allocation Failure)
> 13G->13G(14G), 30.3480523 secs]
> 2017-03-21T11:28:40.117-0400: 54436.680: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.8220513 secs]
> 2017-03-21T11:29:08.459-0400: 54465.022: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.4691271 secs]
> 2017-03-21T11:29:37.114-0400: 54493.676: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.0275733 secs]
> 2017-03-21T11:30:04.635-0400: 54521.198: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.1902627 secs]
> 2017-03-21T11:30:32.114-0400: 54548.676: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.8872850 secs]
> 2017-03-21T11:31:01.430-0400: 54577.993: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.1609706 secs]
> 2017-03-21T11:31:29.024-0400: 54605.587: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.3635138 secs]
> 2017-03-21T11:31:57.303-0400: 54633.865: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.4143510 secs]
> 2017-03-21T11:32:25.110-0400: 54661.672: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.8595986 secs]
> 2017-03-21T11:32:53.922-0400: 54690.485: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.5242543 secs]
> 2017-03-21T11:33:21.867-0400: 54718.429: [Full GC (Allocation Failure)
> 13G->13G(14G), 30.8930130 secs]
> 2017-03-21T11:33:53.712-0400: 54750.275: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.6523013 secs]
> 2017-03-21T11:34:21.760-0400: 54778.322: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.3030198 secs]
> 2017-03-21T11:34:50.073-0400: 54806.635: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.1594154 secs]
> 2017-03-21T11:35:17.743-0400: 54834.306: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.3766949 secs]
> 2017-03-21T11:35:45.797-0400: 54862.360: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.5756770 secs]
> 2017-03-21T11:36:13.816-0400: 54890.378: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.5541813 secs]
> 2017-03-21T11:36:41.926-0400: 54918.488: [Full GC (Allocation Failure)
> 13G->13G(14G), 33.7510103 secs]
> 2017-03-21T11:37:16.132-0400: 54952.695: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.4856611 secs]
> 2017-03-21T11:37:44.454-0400: 54981.017: [Full GC (Allocation Failure)
> 13G->13G(14G), 28.1269335 secs]
> 2017-03-21T11:38:12.774-0400: 55009.337: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.7830448 secs]
> 2017-03-21T11:38:40.840-0400: 55037.402: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.3527326 secs]
> 2017-03-21T11:39:08.610-0400: 55065.173: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.5828941 secs]
> 2017-03-21T11:39:36.833-0400: 55093.396: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.9303030 secs]
> 2017-03-21T11:40:05.265-0400: 55121.828: [Full GC (Allocation Failure)
> 13G->13G(14G), 36.9902867 secs]
> 2017-03-21T11:40:42.400-0400: 55158.963: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.6835744 secs]
> 2017-03-21T11:41:10.529-0400: 55187.091: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.1899555 secs]
> 2017-03-21T11:41:38.018-0400: 55214.581: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.7309706 secs]
> 2017-03-21T11:42:06.062-0400: 55242.624: [Full GC (Allocation Failure)
> 13G->13G(14G), 29.2103234 secs]
> 2017-03-21T11:42:35.579-0400: 55272.141: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.9902741 secs]
> 2017-03-21T11:43:03.955-0400: 55300.518: [Full GC (Allocation Failure)
> 13G->13G(14G), 28.0764545 secs]
> 2017-03-21T11:43:32.186-0400: 55328.748: [Full GC (Allocation Failure)
> 13G->13G(14G), 28.3913296 secs]
> 2017-03-21T11:44:00.810-0400: 55357.373: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.0945131 secs]
> 2017-03-21T11:44:28.343-0400: 55384.905: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.3357763 secs]
> 2017-03-21T11:44:55.909-0400: 55412.471: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.9216626 secs]
> 2017-03-21T11:45:24.070-0400: 55440.632: [Full GC (Allocation Failure)
> 13G->13G(14G), 29.8189953 secs]
> 2017-03-21T11:45:54.092-0400: 55470.654: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.9992085 secs]
> 2017-03-21T11:46:22.219-0400: 55498.782: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.4622100 secs]
> 2017-03-21T11:46:49.983-0400: 55526.545: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.9159143 secs]
> 2017-03-21T11:47:18.029-0400: 55554.592: [Full GC (Allocation Failure)
> 13G->13G(14G), 28.5207919 secs]
> 2017-03-21T11:47:46.938-0400: 55583.500: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.0355291 secs]
> 2017-03-21T11:48:14.235-0400: 55610.798: [Full GC (Allocation Failure)
> 13G->13G(14G), 35.6284865 secs]
> 2017-03-21T11:48:50.013-0400: 55646.575: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.3299061 secs]
> 2017-03-21T11:49:17.592-0400: 55674.154: [Full GC (Allocation Failure)
> 13G->13G(14G), 27.2544348 secs]
> {noformat}
> When a node enters the above state:, it becomes useless over 1-5 minutes:
> * Client connections die
> * New connections time out
> * CPU usage drops - all threads do nothing except for 1 thread near 100% CPU
> ("VM Thread" os_prio=0 tid=0x00007ffaf8209bf0 nid=0x2d53 runnable"), doing
> the above GC in a loop
> * JMX becomes unqueriable
> * Regular kill signal is ignored (needs kill -9)
> I've managed to extract a full heap dump of a node while in this state, but
> it's 20GB and I have not been able to find a tool that can analyze it.
> I've extracted a histogram of heap objects. From a healthy node:
> {noformat}
> num #instances #bytes class name
> ----------------------------------------------
> 1: 12688007 1116544616 io.netty.buffer.PooledUnsafeDirectByteBuf
> 2: 24938940 798046080 io.netty.util.Recycler$DefaultHandle
> 3: 11457548 733283072
> io.netty.channel.ChannelOutboundBuffer$Entry
> 4: 9582138 459942624 java.nio.HeapByteBuffer
> 5: 16974 393432416 [I
> 6: 2014719 311356992 [B
> 7: 4982167 181324848 [Ljava.lang.Object;
> 8: 8531 118782704 [Lio.netty.util.Recycler$DefaultHandle;
> 9: 2827662 113106480 org.apache.cassandra.db.rows.BufferCell
> 10: 2737816 87610112 org.apache.cassandra.db.rows.BTreeRow
> 11: 2740110 70228240 [Ljava.nio.ByteBuffer;
> 12: 2738081 65713944 org.apache.cassandra.db.BufferClustering
> 13: 1763956 56446592
> com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$Node
> 14: 1763736 56439552
> com.googlecode.concurrentlinkedhashmap.ConcurrentHashMapV8$Node
> 15: 2205825 52939800
> com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$WeightedValue
> 16: 1617026 51744832
> java.util.concurrent.ConcurrentHashMap$Node
> 17: 1769934 42478416 org.apache.cassandra.cache.KeyCacheKey
> {noformat}
> From an unhealthy node:
> {noformat}
> num #instances #bytes class name
> ----------------------------------------------
> 1: 103726026 4978849248 java.nio.HeapByteBuffer
> 2: 95646048 2832233872 [B
> 3: 28076465 1207916952 [Ljava.lang.Object;
> 4: 12803448 1126703424 io.netty.buffer.PooledUnsafeDirectByteBuf
> 5: 25054634 801748288 io.netty.util.Recycler$DefaultHandle
> 6: 11457548 733283072
> io.netty.channel.ChannelOutboundBuffer$Entry
> 7: 24725380 593409120 java.util.ArrayList
> 8: 5808290 185865280 java.util.RegularEnumSet
> 9: 2870889 160769784
> org.apache.cassandra.transport.messages.ResultMessage$Rows
> 10: 2937336 140992128 io.netty.buffer.SlicedAbstractByteBuf
> 11: 8854 118773984 [Lio.netty.util.Recycler$DefaultHandle;
> 12: 2830805 113232200 org.apache.cassandra.db.rows.BufferCell
> 13: 2937336 93994752
> org.apache.cassandra.transport.Frame$Header
> 14: 2870928 91869696
> org.apache.cassandra.cql3.ResultSet$ResultMetadata
> 15: 2728627 87316064 org.apache.cassandra.db.rows.BTreeRow
> 16: 2947033 70728792
> java.util.concurrent.ConcurrentLinkedQueue$Node
> 17: 2937336 70496064 org.apache.cassandra.transport.Frame
> {noformat}
> The increase is clear in these top 3 object types:
> {noformat}
> 1: 103726026 4978849248 java.nio.HeapByteBuffer
> 2: 95646048 2832233872 [B
> 3: 28076465 1207916952 [Ljava.lang.Object;
> {noformat}
> Other related observations:
> Client socket states: Essentially the JVM has stopped working, and the
> recv-queue is piling up kernel-side but not being consumed app-side. Worth
> noting that the peers (clients) have given up and closed the connections on
> their end, but they still appear ESTABlished on cassandra's side since it
> hasn't drain the read buffer till EOF/error/complete the closure:
> {noformat}
> ESTAB 558612 0 172.30.72.133:9042 172.30.72.145:11418
> ESTAB 564172 0 172.30.72.133:9042 172.30.72.145:12420
> ESTAB 545284 0 172.30.72.133:9042 172.30.72.145:18212
> ESTAB 565468 0 172.30.72.133:9042 172.30.72.145:18346
> ESTAB 525884 0 172.30.72.133:9042 172.30.72.145:18352
> ESTAB 532324 0 172.30.72.133:9042 172.30.72.145:28896
> ESTAB 517592 0 172.30.72.133:9042 172.30.72.145:34504
> ESTAB 553392 0 172.30.72.133:9042 172.30.72.145:36926
> ESTAB 581952 0 172.30.72.133:9042 172.30.72.145:40684
> ESTAB 543668 0 172.30.72.133:9042 172.30.72.145:42282
> ESTAB 511860 0 172.30.72.133:9042 172.30.72.145:47350
> ESTAB 590380 0 172.30.72.133:9042 172.30.72.145:47566
> ESTAB 583120 0 172.30.72.133:9042 172.30.72.145:50386
> ESTAB 536772 0 172.30.72.133:9042 172.30.72.145:52656
> ESTAB 31 0 172.30.72.133:9042 172.30.72.145:52946
> ESTAB 532140 0 172.30.72.133:9042 172.30.72.145:53618
> ESTAB 14397 0 172.30.72.133:9042 172.30.72.150:10016
> ESTAB 6444 0 172.30.72.133:9042 172.30.72.150:12176
> ESTAB 7568 0 172.30.72.133:9042 172.30.72.150:12934
> ESTAB 139479 0 172.30.72.133:9042 172.30.72.150:13186
> ESTAB 6896 0 172.30.72.133:9042 172.30.72.150:17316
> ESTAB 7808 0 172.30.72.133:9042 172.30.72.150:17448
> ESTAB 8712 0 172.30.72.133:9042 172.30.72.150:17748
> ESTAB 6944 0 172.30.72.133:9042 172.30.72.150:18472
> ESTAB 136700 0 172.30.72.133:9042 172.30.72.150:19860
> ESTAB 16260 0 172.30.72.133:9042 172.30.72.150:20696
> ESTAB 8943 0 172.30.72.133:9042 172.30.72.150:21208
> ESTAB 132877 0 172.30.72.133:9042 172.30.72.150:21420
> ESTAB 6936 0 172.30.72.133:9042 172.30.72.150:25440
> ESTAB 92 0 172.30.72.133:9042 172.30.72.150:25778
> ESTAB 137312 0 172.30.72.133:9042 172.30.72.150:28258
> ESTAB 40 0 172.30.72.133:9042 172.30.72.150:28348
> ESTAB 6940 0 172.30.72.133:9042 172.30.72.150:29718
> ESTAB 6416 0 172.30.72.133:9042 172.30.72.150:29858
> ESTAB 6580 0 172.30.72.133:9042 172.30.72.150:30518
> ESTAB 4202 0 172.30.72.133:9042 172.30.72.150:30948
> ESTAB 6060 0 172.30.72.133:9042 172.30.72.150:35276
> ESTAB 0 0 172.30.72.133:9042 172.30.72.150:40114
> ESTAB 134303 0 172.30.72.133:9042 172.30.72.150:42212
> ESTAB 396 0 172.30.72.133:9042 172.30.72.150:44848
> ESTAB 136450 0 172.30.72.133:9042 172.30.72.150:47742
> ESTAB 8318 0 172.30.72.133:9042 172.30.72.150:51054
> ESTAB 11976 0 172.30.72.133:9042 172.30.72.150:54560
> ESTAB 136193 0 172.30.72.133:9042 172.30.72.150:56094
> ESTAB 138737 0 172.30.72.133:9042 172.30.72.150:60382
> ESTAB 7996 0 172.30.72.133:9042 172.30.72.150:60932
> ESTAB 137235 0 172.30.72.133:9042 172.30.72.150:61568
> ESTAB 8184 0 172.30.72.133:9042 172.30.72.150:63054
> ESTAB 985276 0 172.30.72.133:9042 172.30.72.175:23494
> ESTAB 601528 0 172.30.72.133:9042 172.30.72.175:24140
> ESTAB 557844 0 172.30.72.133:9042 172.30.72.175:42248
> ESTAB 469344 0 172.30.72.133:9042 172.30.72.175:59704
> ESTAB 507160 0 172.30.72.133:9042 172.30.72.185:11958
> ESTAB 449196 0 172.30.72.133:9042 172.30.72.185:15458
> ESTAB 487796 0 172.30.72.133:9042 172.30.72.185:16458
> ESTAB 441132 0 172.30.72.133:9042 172.30.72.185:23194
> ESTAB 486664 0 172.30.72.133:9042 172.30.72.185:25276
> ESTAB 554780 0 172.30.72.133:9042 172.30.72.185:29102
> ESTAB 496248 0 172.30.72.133:9042 172.30.72.185:29428
> ESTAB 420048 0 172.30.72.133:9042 172.30.72.185:39650
> ESTAB 498704 0 172.30.72.133:9042 172.30.72.185:46424
> ESTAB 506488 0 172.30.72.133:9042 172.30.72.185:49722
> ESTAB 509416 0 172.30.72.133:9042 172.30.72.185:55838
> ESTAB 511956 0 172.30.72.133:9042 172.30.72.185:57670
> ESTAB 496720 0 172.30.72.133:9042 172.30.72.185:59884
> {noformat}
> perf top during this state:
> {noformat}
> 19.73% libjvm.so [.]
> InstanceKlass::oop_follow_contents
> 14.05% libjvm.so [.]
> ObjArrayKlass::oop_follow_contents
> 7.83% libjvm.so [.]
> MarkSweep::follow_stack
> 6.38% libjvm.so [.]
> HeapRegion::block_size
> 5.41% libjvm.so [.]
> OtherRegionsTable::add_reference
> 5.12% libjvm.so [.]
> UpdateRSOopClosure::do_oop
> 4.77% libjvm.so [.]
> HeapRegion::block_is_obj
> {noformat}
> Frequency: Affecting nodes in our 2 busier DCs (other 2 DCs serve less
> volume). ~ 2-10 times per day. Not always same nodes (though lately 2
> hiccup more frequently than others).
> Tried:
> * Upgrading cassandra from 3.9 to 3.10
> * Increasing heap size from 8GB to 14GB
> * Upgrading JVM 1.8 from u112 to u121
> * Switching from CMS to G1GC
> The last change, switching from CMS to G1GC reduced the frequency of the
> problem but has not eliminated it.
> Problem 2: Read volume in DC depressed
> With our RF set to 2 per DC, if 2 or more nodes in a DC enter the above GC
> loop described in problem 1, all nodes in the same DC accumulate very high
> active/pending/blocked Task-Transport queues.
> Total read volume for the DC drops to 10%-30% of normal.
> Tried:
> * Increasing -Dcassandra.max_queued_native_transport_requests to 4096
> * Reducing read_request_timeout_in_ms to 400
> * Reducing write_request_timeout_in_ms to 800
> Improved things slightly but not too much.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)