[ 
https://issues.apache.org/jira/browse/CASSANDRA-13365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16704403#comment-16704403
 ] 

Tuo Zhu commented on CASSANDRA-13365:
-------------------------------------

We have the exact same issue on our 3 nodes 3.11.3 cluster. Our data model is 
kinda like a graph database.

There are several types of nodes and one big table to store edges from one node 
to another. Edge table uses node id as partition keys. Some node has huge 
amount of edges to other ones(10s of millions), so we added another shard_key 
into partition keys to make partition smaller. From the log, large partitions 
now at size 100MB ~ 600MB, but the issue isn't related to "large partition 
write", from our experiences.

When we are running heavy loading task on the cluster, some of the nodes stuck 
in Full GC cycle just like OP described. It happens every 10 - 30 hours and 
won't recover even if the loads stopped for hours. 

Our node has 16C/32T and 64G memory. I'm using G1 and 24G xmx. I'll try a dump 
analyze later.

> Nodes entering GC loop, does not recover
> ----------------------------------------
>
>                 Key: CASSANDRA-13365
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13365
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: 34-node cluster over 4 DCs
> Linux CentOS 7.2 x86
> Mix of 64GB/128GB RAM / node
> Mix of 32/40 hardware threads / node, Xeon ~2.4Ghz
> High read volume, low write volume, occasional sstable bulk loading
>            Reporter: Mina Naguib
>            Priority: Major
>
> Over the last week we've been observing two related problems affecting our 
> Cassandra cluster
> Problem 1: 1-few nodes per DC entering GC loop, not recovering
> Checking the heap usage stats, there's a sudden jump of 1-3GB. Some nodes 
> recover, but some don't and log this:
> {noformat}
> 2017-03-21T11:23:02.957-0400: 54099.519: [Full GC (Allocation Failure)  
> 13G->11G(14G), 29.4127307 secs]
> 2017-03-21T11:23:45.270-0400: 54141.833: [Full GC (Allocation Failure)  
> 13G->12G(14G), 28.1561881 secs]
> 2017-03-21T11:24:20.307-0400: 54176.869: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.7019501 secs]
> 2017-03-21T11:24:50.528-0400: 54207.090: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.1372267 secs]
> 2017-03-21T11:25:19.190-0400: 54235.752: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.0703975 secs]
> 2017-03-21T11:25:46.711-0400: 54263.273: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.3187768 secs]
> 2017-03-21T11:26:15.419-0400: 54291.981: [Full GC (Allocation Failure)  
> 13G->13G(14G), 26.9493405 secs]
> 2017-03-21T11:26:43.399-0400: 54319.961: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.5222085 secs]
> 2017-03-21T11:27:11.383-0400: 54347.945: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.1769581 secs]
> 2017-03-21T11:27:40.174-0400: 54376.737: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.4639031 secs]
> 2017-03-21T11:28:08.946-0400: 54405.508: [Full GC (Allocation Failure)  
> 13G->13G(14G), 30.3480523 secs]
> 2017-03-21T11:28:40.117-0400: 54436.680: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.8220513 secs]
> 2017-03-21T11:29:08.459-0400: 54465.022: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.4691271 secs]
> 2017-03-21T11:29:37.114-0400: 54493.676: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.0275733 secs]
> 2017-03-21T11:30:04.635-0400: 54521.198: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.1902627 secs]
> 2017-03-21T11:30:32.114-0400: 54548.676: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.8872850 secs]
> 2017-03-21T11:31:01.430-0400: 54577.993: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.1609706 secs]
> 2017-03-21T11:31:29.024-0400: 54605.587: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.3635138 secs]
> 2017-03-21T11:31:57.303-0400: 54633.865: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.4143510 secs]
> 2017-03-21T11:32:25.110-0400: 54661.672: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.8595986 secs]
> 2017-03-21T11:32:53.922-0400: 54690.485: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.5242543 secs]
> 2017-03-21T11:33:21.867-0400: 54718.429: [Full GC (Allocation Failure)  
> 13G->13G(14G), 30.8930130 secs]
> 2017-03-21T11:33:53.712-0400: 54750.275: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.6523013 secs]
> 2017-03-21T11:34:21.760-0400: 54778.322: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.3030198 secs]
> 2017-03-21T11:34:50.073-0400: 54806.635: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.1594154 secs]
> 2017-03-21T11:35:17.743-0400: 54834.306: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.3766949 secs]
> 2017-03-21T11:35:45.797-0400: 54862.360: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.5756770 secs]
> 2017-03-21T11:36:13.816-0400: 54890.378: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.5541813 secs]
> 2017-03-21T11:36:41.926-0400: 54918.488: [Full GC (Allocation Failure)  
> 13G->13G(14G), 33.7510103 secs]
> 2017-03-21T11:37:16.132-0400: 54952.695: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.4856611 secs]
> 2017-03-21T11:37:44.454-0400: 54981.017: [Full GC (Allocation Failure)  
> 13G->13G(14G), 28.1269335 secs]
> 2017-03-21T11:38:12.774-0400: 55009.337: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.7830448 secs]
> 2017-03-21T11:38:40.840-0400: 55037.402: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.3527326 secs]
> 2017-03-21T11:39:08.610-0400: 55065.173: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.5828941 secs]
> 2017-03-21T11:39:36.833-0400: 55093.396: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.9303030 secs]
> 2017-03-21T11:40:05.265-0400: 55121.828: [Full GC (Allocation Failure)  
> 13G->13G(14G), 36.9902867 secs]
> 2017-03-21T11:40:42.400-0400: 55158.963: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.6835744 secs]
> 2017-03-21T11:41:10.529-0400: 55187.091: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.1899555 secs]
> 2017-03-21T11:41:38.018-0400: 55214.581: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.7309706 secs]
> 2017-03-21T11:42:06.062-0400: 55242.624: [Full GC (Allocation Failure)  
> 13G->13G(14G), 29.2103234 secs]
> 2017-03-21T11:42:35.579-0400: 55272.141: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.9902741 secs]
> 2017-03-21T11:43:03.955-0400: 55300.518: [Full GC (Allocation Failure)  
> 13G->13G(14G), 28.0764545 secs]
> 2017-03-21T11:43:32.186-0400: 55328.748: [Full GC (Allocation Failure)  
> 13G->13G(14G), 28.3913296 secs]
> 2017-03-21T11:44:00.810-0400: 55357.373: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.0945131 secs]
> 2017-03-21T11:44:28.343-0400: 55384.905: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.3357763 secs]
> 2017-03-21T11:44:55.909-0400: 55412.471: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.9216626 secs]
> 2017-03-21T11:45:24.070-0400: 55440.632: [Full GC (Allocation Failure)  
> 13G->13G(14G), 29.8189953 secs]
> 2017-03-21T11:45:54.092-0400: 55470.654: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.9992085 secs]
> 2017-03-21T11:46:22.219-0400: 55498.782: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.4622100 secs]
> 2017-03-21T11:46:49.983-0400: 55526.545: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.9159143 secs]
> 2017-03-21T11:47:18.029-0400: 55554.592: [Full GC (Allocation Failure)  
> 13G->13G(14G), 28.5207919 secs]
> 2017-03-21T11:47:46.938-0400: 55583.500: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.0355291 secs]
> 2017-03-21T11:48:14.235-0400: 55610.798: [Full GC (Allocation Failure)  
> 13G->13G(14G), 35.6284865 secs]
> 2017-03-21T11:48:50.013-0400: 55646.575: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.3299061 secs]
> 2017-03-21T11:49:17.592-0400: 55674.154: [Full GC (Allocation Failure)  
> 13G->13G(14G), 27.2544348 secs]
> {noformat}
> When a node enters the above state:, it becomes useless over 1-5 minutes:
> * Client connections die
> * New connections time out
> * CPU usage drops - all threads do nothing except for 1 thread near 100% CPU 
> ("VM Thread" os_prio=0 tid=0x00007ffaf8209bf0 nid=0x2d53 runnable"), doing 
> the above GC in a loop
> * JMX becomes unqueriable
> * Regular kill signal is ignored (needs kill -9)
> I've managed to extract a full heap dump of a node while in this state, but 
> it's 20GB and I have not been able to find a tool that can analyze it.
> I've extracted a histogram of heap objects.  From a healthy node:
> {noformat}
>  num     #instances         #bytes  class name
> ----------------------------------------------
>    1:      12688007     1116544616  io.netty.buffer.PooledUnsafeDirectByteBuf
>    2:      24938940      798046080  io.netty.util.Recycler$DefaultHandle
>    3:      11457548      733283072  
> io.netty.channel.ChannelOutboundBuffer$Entry
>    4:       9582138      459942624  java.nio.HeapByteBuffer
>    5:         16974      393432416  [I
>    6:       2014719      311356992  [B
>    7:       4982167      181324848  [Ljava.lang.Object;
>    8:          8531      118782704  [Lio.netty.util.Recycler$DefaultHandle;
>    9:       2827662      113106480  org.apache.cassandra.db.rows.BufferCell
>   10:       2737816       87610112  org.apache.cassandra.db.rows.BTreeRow
>   11:       2740110       70228240  [Ljava.nio.ByteBuffer;
>   12:       2738081       65713944  org.apache.cassandra.db.BufferClustering
>   13:       1763956       56446592  
> com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$Node
>   14:       1763736       56439552  
> com.googlecode.concurrentlinkedhashmap.ConcurrentHashMapV8$Node
>   15:       2205825       52939800  
> com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$WeightedValue
>   16:       1617026       51744832  
> java.util.concurrent.ConcurrentHashMap$Node
>   17:       1769934       42478416  org.apache.cassandra.cache.KeyCacheKey
> {noformat}
> From an unhealthy node:
> {noformat}
>  num     #instances         #bytes  class name
> ----------------------------------------------
>    1:     103726026     4978849248  java.nio.HeapByteBuffer
>    2:      95646048     2832233872  [B
>    3:      28076465     1207916952  [Ljava.lang.Object;
>    4:      12803448     1126703424  io.netty.buffer.PooledUnsafeDirectByteBuf
>    5:      25054634      801748288  io.netty.util.Recycler$DefaultHandle
>    6:      11457548      733283072  
> io.netty.channel.ChannelOutboundBuffer$Entry
>    7:      24725380      593409120  java.util.ArrayList
>    8:       5808290      185865280  java.util.RegularEnumSet
>    9:       2870889      160769784  
> org.apache.cassandra.transport.messages.ResultMessage$Rows
>   10:       2937336      140992128  io.netty.buffer.SlicedAbstractByteBuf
>   11:          8854      118773984  [Lio.netty.util.Recycler$DefaultHandle;
>   12:       2830805      113232200  org.apache.cassandra.db.rows.BufferCell
>   13:       2937336       93994752  
> org.apache.cassandra.transport.Frame$Header
>   14:       2870928       91869696  
> org.apache.cassandra.cql3.ResultSet$ResultMetadata
>   15:       2728627       87316064  org.apache.cassandra.db.rows.BTreeRow
>   16:       2947033       70728792  
> java.util.concurrent.ConcurrentLinkedQueue$Node
>   17:       2937336       70496064  org.apache.cassandra.transport.Frame
> {noformat}
> The increase is clear in these top 3 object types:
> {noformat}
>    1:     103726026     4978849248  java.nio.HeapByteBuffer
>    2:      95646048     2832233872  [B
>    3:      28076465     1207916952  [Ljava.lang.Object;
> {noformat}
> Other related observations:
> Client socket states: Essentially the JVM has stopped working, and the 
> recv-queue is piling up kernel-side but not being consumed app-side.  Worth 
> noting that the peers (clients) have given up and closed the connections on 
> their end, but they still appear ESTABlished on cassandra's side since it 
> hasn't drain the read buffer till EOF/error/complete the closure:
> {noformat}
> ESTAB      558612 0      172.30.72.133:9042               172.30.72.145:11418
> ESTAB      564172 0      172.30.72.133:9042               172.30.72.145:12420
> ESTAB      545284 0      172.30.72.133:9042               172.30.72.145:18212
> ESTAB      565468 0      172.30.72.133:9042               172.30.72.145:18346
> ESTAB      525884 0      172.30.72.133:9042               172.30.72.145:18352
> ESTAB      532324 0      172.30.72.133:9042               172.30.72.145:28896
> ESTAB      517592 0      172.30.72.133:9042               172.30.72.145:34504
> ESTAB      553392 0      172.30.72.133:9042               172.30.72.145:36926
> ESTAB      581952 0      172.30.72.133:9042               172.30.72.145:40684
> ESTAB      543668 0      172.30.72.133:9042               172.30.72.145:42282
> ESTAB      511860 0      172.30.72.133:9042               172.30.72.145:47350
> ESTAB      590380 0      172.30.72.133:9042               172.30.72.145:47566
> ESTAB      583120 0      172.30.72.133:9042               172.30.72.145:50386
> ESTAB      536772 0      172.30.72.133:9042               172.30.72.145:52656
> ESTAB      31     0      172.30.72.133:9042               172.30.72.145:52946
> ESTAB      532140 0      172.30.72.133:9042               172.30.72.145:53618
> ESTAB      14397  0      172.30.72.133:9042               172.30.72.150:10016
> ESTAB      6444   0      172.30.72.133:9042               172.30.72.150:12176
> ESTAB      7568   0      172.30.72.133:9042               172.30.72.150:12934
> ESTAB      139479 0      172.30.72.133:9042               172.30.72.150:13186
> ESTAB      6896   0      172.30.72.133:9042               172.30.72.150:17316
> ESTAB      7808   0      172.30.72.133:9042               172.30.72.150:17448
> ESTAB      8712   0      172.30.72.133:9042               172.30.72.150:17748
> ESTAB      6944   0      172.30.72.133:9042               172.30.72.150:18472
> ESTAB      136700 0      172.30.72.133:9042               172.30.72.150:19860
> ESTAB      16260  0      172.30.72.133:9042               172.30.72.150:20696
> ESTAB      8943   0      172.30.72.133:9042               172.30.72.150:21208
> ESTAB      132877 0      172.30.72.133:9042               172.30.72.150:21420
> ESTAB      6936   0      172.30.72.133:9042               172.30.72.150:25440
> ESTAB      92     0      172.30.72.133:9042               172.30.72.150:25778
> ESTAB      137312 0      172.30.72.133:9042               172.30.72.150:28258
> ESTAB      40     0      172.30.72.133:9042               172.30.72.150:28348
> ESTAB      6940   0      172.30.72.133:9042               172.30.72.150:29718
> ESTAB      6416   0      172.30.72.133:9042               172.30.72.150:29858
> ESTAB      6580   0      172.30.72.133:9042               172.30.72.150:30518
> ESTAB      4202   0      172.30.72.133:9042               172.30.72.150:30948
> ESTAB      6060   0      172.30.72.133:9042               172.30.72.150:35276
> ESTAB      0      0      172.30.72.133:9042               172.30.72.150:40114
> ESTAB      134303 0      172.30.72.133:9042               172.30.72.150:42212
> ESTAB      396    0      172.30.72.133:9042               172.30.72.150:44848
> ESTAB      136450 0      172.30.72.133:9042               172.30.72.150:47742
> ESTAB      8318   0      172.30.72.133:9042               172.30.72.150:51054
> ESTAB      11976  0      172.30.72.133:9042               172.30.72.150:54560
> ESTAB      136193 0      172.30.72.133:9042               172.30.72.150:56094
> ESTAB      138737 0      172.30.72.133:9042               172.30.72.150:60382
> ESTAB      7996   0      172.30.72.133:9042               172.30.72.150:60932
> ESTAB      137235 0      172.30.72.133:9042               172.30.72.150:61568
> ESTAB      8184   0      172.30.72.133:9042               172.30.72.150:63054
> ESTAB      985276 0      172.30.72.133:9042               172.30.72.175:23494
> ESTAB      601528 0      172.30.72.133:9042               172.30.72.175:24140
> ESTAB      557844 0      172.30.72.133:9042               172.30.72.175:42248
> ESTAB      469344 0      172.30.72.133:9042               172.30.72.175:59704
> ESTAB      507160 0      172.30.72.133:9042               172.30.72.185:11958
> ESTAB      449196 0      172.30.72.133:9042               172.30.72.185:15458
> ESTAB      487796 0      172.30.72.133:9042               172.30.72.185:16458
> ESTAB      441132 0      172.30.72.133:9042               172.30.72.185:23194
> ESTAB      486664 0      172.30.72.133:9042               172.30.72.185:25276
> ESTAB      554780 0      172.30.72.133:9042               172.30.72.185:29102
> ESTAB      496248 0      172.30.72.133:9042               172.30.72.185:29428
> ESTAB      420048 0      172.30.72.133:9042               172.30.72.185:39650
> ESTAB      498704 0      172.30.72.133:9042               172.30.72.185:46424
> ESTAB      506488 0      172.30.72.133:9042               172.30.72.185:49722
> ESTAB      509416 0      172.30.72.133:9042               172.30.72.185:55838
> ESTAB      511956 0      172.30.72.133:9042               172.30.72.185:57670
> ESTAB      496720 0      172.30.72.133:9042               172.30.72.185:59884
> {noformat}
> perf top during this state:
> {noformat}
>   19.73%  libjvm.so                                              [.] 
> InstanceKlass::oop_follow_contents
>   14.05%  libjvm.so                                              [.] 
> ObjArrayKlass::oop_follow_contents
>    7.83%  libjvm.so                                              [.] 
> MarkSweep::follow_stack
>    6.38%  libjvm.so                                              [.] 
> HeapRegion::block_size
>    5.41%  libjvm.so                                              [.] 
> OtherRegionsTable::add_reference
>    5.12%  libjvm.so                                              [.] 
> UpdateRSOopClosure::do_oop
>    4.77%  libjvm.so                                              [.] 
> HeapRegion::block_is_obj
> {noformat}
> Frequency: Affecting nodes in our 2 busier DCs (other 2 DCs serve less 
> volume).  ~ 2-10 times per day.  Not always same nodes (though lately 2 
> hiccup more frequently than others).
> Tried:
> * Upgrading cassandra from 3.9 to 3.10
> * Increasing heap size from 8GB to 14GB
> * Upgrading JVM 1.8 from u112 to u121
> * Switching from CMS to G1GC
> The last change, switching from CMS to G1GC reduced the frequency of the 
> problem but has not eliminated it.
> Problem 2: Read volume in DC depressed
> With our RF set to 2 per DC, if 2 or more nodes in a DC enter the above GC 
> loop described in problem 1, all nodes in the same DC accumulate very high 
> active/pending/blocked Task-Transport queues.
> Total read volume for the DC drops to 10%-30% of normal.
> Tried:
> * Increasing -Dcassandra.max_queued_native_transport_requests to 4096
> * Reducing read_request_timeout_in_ms to 400
> * Reducing write_request_timeout_in_ms to 800
> Improved things slightly but not too much.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to