Mina Naguib created CASSANDRA-13365: ---------------------------------------
Summary: Nodes entering GC loop, does not recover Key: CASSANDRA-13365 URL: https://issues.apache.org/jira/browse/CASSANDRA-13365 Project: Cassandra Issue Type: Bug Environment: 34-node cluster over 4 DCs Linux CentOS 7.2 x86 Mix of 64GB/128GB RAM / node Mix of 32/40 hardware threads / node, Xeon ~2.4Ghz High read volume, low write volume, occasional sstable bulk loading Reporter: Mina Naguib Over the last week we've been observing two related problems affecting our Cassandra cluster Problem 1: 1-few nodes per DC entering GC loop, not recovering Checking the heap usage stats, there's a sudden jump of 1-3GB. Some nodes recover, but some don't and log this: {noformat} 2017-03-21T11:23:02.957-0400: 54099.519: [Full GC (Allocation Failure) 13G->11G(14G), 29.4127307 secs] 2017-03-21T11:23:45.270-0400: 54141.833: [Full GC (Allocation Failure) 13G->12G(14G), 28.1561881 secs] 2017-03-21T11:24:20.307-0400: 54176.869: [Full GC (Allocation Failure) 13G->13G(14G), 27.7019501 secs] 2017-03-21T11:24:50.528-0400: 54207.090: [Full GC (Allocation Failure) 13G->13G(14G), 27.1372267 secs] 2017-03-21T11:25:19.190-0400: 54235.752: [Full GC (Allocation Failure) 13G->13G(14G), 27.0703975 secs] 2017-03-21T11:25:46.711-0400: 54263.273: [Full GC (Allocation Failure) 13G->13G(14G), 27.3187768 secs] 2017-03-21T11:26:15.419-0400: 54291.981: [Full GC (Allocation Failure) 13G->13G(14G), 26.9493405 secs] 2017-03-21T11:26:43.399-0400: 54319.961: [Full GC (Allocation Failure) 13G->13G(14G), 27.5222085 secs] 2017-03-21T11:27:11.383-0400: 54347.945: [Full GC (Allocation Failure) 13G->13G(14G), 27.1769581 secs] 2017-03-21T11:27:40.174-0400: 54376.737: [Full GC (Allocation Failure) 13G->13G(14G), 27.4639031 secs] 2017-03-21T11:28:08.946-0400: 54405.508: [Full GC (Allocation Failure) 13G->13G(14G), 30.3480523 secs] 2017-03-21T11:28:40.117-0400: 54436.680: [Full GC (Allocation Failure) 13G->13G(14G), 27.8220513 secs] 2017-03-21T11:29:08.459-0400: 54465.022: [Full GC (Allocation Failure) 13G->13G(14G), 27.4691271 secs] 2017-03-21T11:29:37.114-0400: 54493.676: [Full GC (Allocation Failure) 13G->13G(14G), 27.0275733 secs] 2017-03-21T11:30:04.635-0400: 54521.198: [Full GC (Allocation Failure) 13G->13G(14G), 27.1902627 secs] 2017-03-21T11:30:32.114-0400: 54548.676: [Full GC (Allocation Failure) 13G->13G(14G), 27.8872850 secs] 2017-03-21T11:31:01.430-0400: 54577.993: [Full GC (Allocation Failure) 13G->13G(14G), 27.1609706 secs] 2017-03-21T11:31:29.024-0400: 54605.587: [Full GC (Allocation Failure) 13G->13G(14G), 27.3635138 secs] 2017-03-21T11:31:57.303-0400: 54633.865: [Full GC (Allocation Failure) 13G->13G(14G), 27.4143510 secs] 2017-03-21T11:32:25.110-0400: 54661.672: [Full GC (Allocation Failure) 13G->13G(14G), 27.8595986 secs] 2017-03-21T11:32:53.922-0400: 54690.485: [Full GC (Allocation Failure) 13G->13G(14G), 27.5242543 secs] 2017-03-21T11:33:21.867-0400: 54718.429: [Full GC (Allocation Failure) 13G->13G(14G), 30.8930130 secs] 2017-03-21T11:33:53.712-0400: 54750.275: [Full GC (Allocation Failure) 13G->13G(14G), 27.6523013 secs] 2017-03-21T11:34:21.760-0400: 54778.322: [Full GC (Allocation Failure) 13G->13G(14G), 27.3030198 secs] 2017-03-21T11:34:50.073-0400: 54806.635: [Full GC (Allocation Failure) 13G->13G(14G), 27.1594154 secs] 2017-03-21T11:35:17.743-0400: 54834.306: [Full GC (Allocation Failure) 13G->13G(14G), 27.3766949 secs] 2017-03-21T11:35:45.797-0400: 54862.360: [Full GC (Allocation Failure) 13G->13G(14G), 27.5756770 secs] 2017-03-21T11:36:13.816-0400: 54890.378: [Full GC (Allocation Failure) 13G->13G(14G), 27.5541813 secs] 2017-03-21T11:36:41.926-0400: 54918.488: [Full GC (Allocation Failure) 13G->13G(14G), 33.7510103 secs] 2017-03-21T11:37:16.132-0400: 54952.695: [Full GC (Allocation Failure) 13G->13G(14G), 27.4856611 secs] 2017-03-21T11:37:44.454-0400: 54981.017: [Full GC (Allocation Failure) 13G->13G(14G), 28.1269335 secs] 2017-03-21T11:38:12.774-0400: 55009.337: [Full GC (Allocation Failure) 13G->13G(14G), 27.7830448 secs] 2017-03-21T11:38:40.840-0400: 55037.402: [Full GC (Allocation Failure) 13G->13G(14G), 27.3527326 secs] 2017-03-21T11:39:08.610-0400: 55065.173: [Full GC (Allocation Failure) 13G->13G(14G), 27.5828941 secs] 2017-03-21T11:39:36.833-0400: 55093.396: [Full GC (Allocation Failure) 13G->13G(14G), 27.9303030 secs] 2017-03-21T11:40:05.265-0400: 55121.828: [Full GC (Allocation Failure) 13G->13G(14G), 36.9902867 secs] 2017-03-21T11:40:42.400-0400: 55158.963: [Full GC (Allocation Failure) 13G->13G(14G), 27.6835744 secs] 2017-03-21T11:41:10.529-0400: 55187.091: [Full GC (Allocation Failure) 13G->13G(14G), 27.1899555 secs] 2017-03-21T11:41:38.018-0400: 55214.581: [Full GC (Allocation Failure) 13G->13G(14G), 27.7309706 secs] 2017-03-21T11:42:06.062-0400: 55242.624: [Full GC (Allocation Failure) 13G->13G(14G), 29.2103234 secs] 2017-03-21T11:42:35.579-0400: 55272.141: [Full GC (Allocation Failure) 13G->13G(14G), 27.9902741 secs] 2017-03-21T11:43:03.955-0400: 55300.518: [Full GC (Allocation Failure) 13G->13G(14G), 28.0764545 secs] 2017-03-21T11:43:32.186-0400: 55328.748: [Full GC (Allocation Failure) 13G->13G(14G), 28.3913296 secs] 2017-03-21T11:44:00.810-0400: 55357.373: [Full GC (Allocation Failure) 13G->13G(14G), 27.0945131 secs] 2017-03-21T11:44:28.343-0400: 55384.905: [Full GC (Allocation Failure) 13G->13G(14G), 27.3357763 secs] 2017-03-21T11:44:55.909-0400: 55412.471: [Full GC (Allocation Failure) 13G->13G(14G), 27.9216626 secs] 2017-03-21T11:45:24.070-0400: 55440.632: [Full GC (Allocation Failure) 13G->13G(14G), 29.8189953 secs] 2017-03-21T11:45:54.092-0400: 55470.654: [Full GC (Allocation Failure) 13G->13G(14G), 27.9992085 secs] 2017-03-21T11:46:22.219-0400: 55498.782: [Full GC (Allocation Failure) 13G->13G(14G), 27.4622100 secs] 2017-03-21T11:46:49.983-0400: 55526.545: [Full GC (Allocation Failure) 13G->13G(14G), 27.9159143 secs] 2017-03-21T11:47:18.029-0400: 55554.592: [Full GC (Allocation Failure) 13G->13G(14G), 28.5207919 secs] 2017-03-21T11:47:46.938-0400: 55583.500: [Full GC (Allocation Failure) 13G->13G(14G), 27.0355291 secs] 2017-03-21T11:48:14.235-0400: 55610.798: [Full GC (Allocation Failure) 13G->13G(14G), 35.6284865 secs] 2017-03-21T11:48:50.013-0400: 55646.575: [Full GC (Allocation Failure) 13G->13G(14G), 27.3299061 secs] 2017-03-21T11:49:17.592-0400: 55674.154: [Full GC (Allocation Failure) 13G->13G(14G), 27.2544348 secs] {noformat} When a node enters the above state:, it becomes useless over 1-5 minutes: * Client connections die * New connections time out * CPU usage drops - all threads do nothing except for 1 thread near 100% CPU ("VM Thread" os_prio=0 tid=0x00007ffaf8209bf0 nid=0x2d53 runnable"), doing the above GC in a loop * JMX becomes unqueriable * Regular kill signal is ignored (needs kill -9) I've managed to extract a full heap dump of a node while in this state, but it's 20GB and I have not been able to find a tool that can analyze it. I've extracted a histogram of heap objects. From a healthy node: {noformat} num #instances #bytes class name ---------------------------------------------- 1: 12688007 1116544616 io.netty.buffer.PooledUnsafeDirectByteBuf 2: 24938940 798046080 io.netty.util.Recycler$DefaultHandle 3: 11457548 733283072 io.netty.channel.ChannelOutboundBuffer$Entry 4: 9582138 459942624 java.nio.HeapByteBuffer 5: 16974 393432416 [I 6: 2014719 311356992 [B 7: 4982167 181324848 [Ljava.lang.Object; 8: 8531 118782704 [Lio.netty.util.Recycler$DefaultHandle; 9: 2827662 113106480 org.apache.cassandra.db.rows.BufferCell 10: 2737816 87610112 org.apache.cassandra.db.rows.BTreeRow 11: 2740110 70228240 [Ljava.nio.ByteBuffer; 12: 2738081 65713944 org.apache.cassandra.db.BufferClustering 13: 1763956 56446592 com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$Node 14: 1763736 56439552 com.googlecode.concurrentlinkedhashmap.ConcurrentHashMapV8$Node 15: 2205825 52939800 com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$WeightedValue 16: 1617026 51744832 java.util.concurrent.ConcurrentHashMap$Node 17: 1769934 42478416 org.apache.cassandra.cache.KeyCacheKey {noformat} >From an unhealthy node: {noformat} num #instances #bytes class name ---------------------------------------------- 1: 103726026 4978849248 java.nio.HeapByteBuffer 2: 95646048 2832233872 [B 3: 28076465 1207916952 [Ljava.lang.Object; 4: 12803448 1126703424 io.netty.buffer.PooledUnsafeDirectByteBuf 5: 25054634 801748288 io.netty.util.Recycler$DefaultHandle 6: 11457548 733283072 io.netty.channel.ChannelOutboundBuffer$Entry 7: 24725380 593409120 java.util.ArrayList 8: 5808290 185865280 java.util.RegularEnumSet 9: 2870889 160769784 org.apache.cassandra.transport.messages.ResultMessage$Rows 10: 2937336 140992128 io.netty.buffer.SlicedAbstractByteBuf 11: 8854 118773984 [Lio.netty.util.Recycler$DefaultHandle; 12: 2830805 113232200 org.apache.cassandra.db.rows.BufferCell 13: 2937336 93994752 org.apache.cassandra.transport.Frame$Header 14: 2870928 91869696 org.apache.cassandra.cql3.ResultSet$ResultMetadata 15: 2728627 87316064 org.apache.cassandra.db.rows.BTreeRow 16: 2947033 70728792 java.util.concurrent.ConcurrentLinkedQueue$Node 17: 2937336 70496064 org.apache.cassandra.transport.Frame {noformat} The increase is clear in these top 3 object types: {noformat} 1: 103726026 4978849248 java.nio.HeapByteBuffer 2: 95646048 2832233872 [B 3: 28076465 1207916952 [Ljava.lang.Object; {noformat} Other related observations: Client socket states: Essentially the JVM has stopped working, and the recv-queue is piling up kernel-side but not being consumed app-side. Worth noting that the peers (clients) have given up and closed the connections on their end, but they still appear ESTABlished on cassandra's side since it hasn't drain the read buffer till EOF/error/complete the closure: {noformat} ESTAB 558612 0 172.30.72.133:9042 172.30.72.145:11418 ESTAB 564172 0 172.30.72.133:9042 172.30.72.145:12420 ESTAB 545284 0 172.30.72.133:9042 172.30.72.145:18212 ESTAB 565468 0 172.30.72.133:9042 172.30.72.145:18346 ESTAB 525884 0 172.30.72.133:9042 172.30.72.145:18352 ESTAB 532324 0 172.30.72.133:9042 172.30.72.145:28896 ESTAB 517592 0 172.30.72.133:9042 172.30.72.145:34504 ESTAB 553392 0 172.30.72.133:9042 172.30.72.145:36926 ESTAB 581952 0 172.30.72.133:9042 172.30.72.145:40684 ESTAB 543668 0 172.30.72.133:9042 172.30.72.145:42282 ESTAB 511860 0 172.30.72.133:9042 172.30.72.145:47350 ESTAB 590380 0 172.30.72.133:9042 172.30.72.145:47566 ESTAB 583120 0 172.30.72.133:9042 172.30.72.145:50386 ESTAB 536772 0 172.30.72.133:9042 172.30.72.145:52656 ESTAB 31 0 172.30.72.133:9042 172.30.72.145:52946 ESTAB 532140 0 172.30.72.133:9042 172.30.72.145:53618 ESTAB 14397 0 172.30.72.133:9042 172.30.72.150:10016 ESTAB 6444 0 172.30.72.133:9042 172.30.72.150:12176 ESTAB 7568 0 172.30.72.133:9042 172.30.72.150:12934 ESTAB 139479 0 172.30.72.133:9042 172.30.72.150:13186 ESTAB 6896 0 172.30.72.133:9042 172.30.72.150:17316 ESTAB 7808 0 172.30.72.133:9042 172.30.72.150:17448 ESTAB 8712 0 172.30.72.133:9042 172.30.72.150:17748 ESTAB 6944 0 172.30.72.133:9042 172.30.72.150:18472 ESTAB 136700 0 172.30.72.133:9042 172.30.72.150:19860 ESTAB 16260 0 172.30.72.133:9042 172.30.72.150:20696 ESTAB 8943 0 172.30.72.133:9042 172.30.72.150:21208 ESTAB 132877 0 172.30.72.133:9042 172.30.72.150:21420 ESTAB 6936 0 172.30.72.133:9042 172.30.72.150:25440 ESTAB 92 0 172.30.72.133:9042 172.30.72.150:25778 ESTAB 137312 0 172.30.72.133:9042 172.30.72.150:28258 ESTAB 40 0 172.30.72.133:9042 172.30.72.150:28348 ESTAB 6940 0 172.30.72.133:9042 172.30.72.150:29718 ESTAB 6416 0 172.30.72.133:9042 172.30.72.150:29858 ESTAB 6580 0 172.30.72.133:9042 172.30.72.150:30518 ESTAB 4202 0 172.30.72.133:9042 172.30.72.150:30948 ESTAB 6060 0 172.30.72.133:9042 172.30.72.150:35276 ESTAB 0 0 172.30.72.133:9042 172.30.72.150:40114 ESTAB 134303 0 172.30.72.133:9042 172.30.72.150:42212 ESTAB 396 0 172.30.72.133:9042 172.30.72.150:44848 ESTAB 136450 0 172.30.72.133:9042 172.30.72.150:47742 ESTAB 8318 0 172.30.72.133:9042 172.30.72.150:51054 ESTAB 11976 0 172.30.72.133:9042 172.30.72.150:54560 ESTAB 136193 0 172.30.72.133:9042 172.30.72.150:56094 ESTAB 138737 0 172.30.72.133:9042 172.30.72.150:60382 ESTAB 7996 0 172.30.72.133:9042 172.30.72.150:60932 ESTAB 137235 0 172.30.72.133:9042 172.30.72.150:61568 ESTAB 8184 0 172.30.72.133:9042 172.30.72.150:63054 ESTAB 985276 0 172.30.72.133:9042 172.30.72.175:23494 ESTAB 601528 0 172.30.72.133:9042 172.30.72.175:24140 ESTAB 557844 0 172.30.72.133:9042 172.30.72.175:42248 ESTAB 469344 0 172.30.72.133:9042 172.30.72.175:59704 ESTAB 507160 0 172.30.72.133:9042 172.30.72.185:11958 ESTAB 449196 0 172.30.72.133:9042 172.30.72.185:15458 ESTAB 487796 0 172.30.72.133:9042 172.30.72.185:16458 ESTAB 441132 0 172.30.72.133:9042 172.30.72.185:23194 ESTAB 486664 0 172.30.72.133:9042 172.30.72.185:25276 ESTAB 554780 0 172.30.72.133:9042 172.30.72.185:29102 ESTAB 496248 0 172.30.72.133:9042 172.30.72.185:29428 ESTAB 420048 0 172.30.72.133:9042 172.30.72.185:39650 ESTAB 498704 0 172.30.72.133:9042 172.30.72.185:46424 ESTAB 506488 0 172.30.72.133:9042 172.30.72.185:49722 ESTAB 509416 0 172.30.72.133:9042 172.30.72.185:55838 ESTAB 511956 0 172.30.72.133:9042 172.30.72.185:57670 ESTAB 496720 0 172.30.72.133:9042 172.30.72.185:59884 {noformat} perf top during this state: {noformat} 19.73% libjvm.so [.] InstanceKlass::oop_follow_contents 14.05% libjvm.so [.] ObjArrayKlass::oop_follow_contents 7.83% libjvm.so [.] MarkSweep::follow_stack 6.38% libjvm.so [.] HeapRegion::block_size 5.41% libjvm.so [.] OtherRegionsTable::add_reference 5.12% libjvm.so [.] UpdateRSOopClosure::do_oop 4.77% libjvm.so [.] HeapRegion::block_is_obj {noformat} Frequency: Affecting nodes in our 2 busier DCs (other 2 DCs serve less volume). ~ 2-10 times per day. Not always same nodes (though lately 2 hiccup more frequently than others). Tried: * Upgrading cassandra from 3.9 to 3.10 * Increasing heap size from 8GB to 14GB * Upgrading JVM 1.8 from u112 to u121 * Switching from CMS to G1GC The last change, switching from CMS to G1GC reduced the frequency of the problem but has not eliminated it. Problem 2: Read volume in DC depressed With our RF set to 2 per DC, if 2 or more nodes in a DC enter the above GC loop described in problem 1, all nodes in the same DC accumulate very high active/pending/blocked Task-Transport queues. Total read volume for the DC drops to 10%-30% of normal. Tried: * Increasing -Dcassandra.max_queued_native_transport_requests to 4096 * Reducing read_request_timeout_in_ms to 400 * Reducing write_request_timeout_in_ms to 800 Improved things slightly but not too much. -- This message was sent by Atlassian JIRA (v6.3.15#6346)