[jira] [Commented] (CASSANDRA-13365) Nodes entering GC loop, does not recover
[ https://issues.apache.org/jira/browse/CASSANDRA-13365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529442#comment-17529442 ] Mina Naguib commented on CASSANDRA-13365: - [~douglasawh] Unfortunately it's been years since I used Cassandra. I don't have any info on whether this still exists in the 4.x series or not. > Nodes entering GC loop, does not recover > > > Key: CASSANDRA-13365 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13365 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Core > Environment: 34-node cluster over 4 DCs > Linux CentOS 7.2 x86 > Mix of 64GB/128GB RAM / node > Mix of 32/40 hardware threads / node, Xeon ~2.4Ghz > High read volume, low write volume, occasional sstable bulk loading >Reporter: Mina Naguib >Priority: Normal > > Over the last week we've been observing two related problems affecting our > Cassandra cluster > Problem 1: 1-few nodes per DC entering GC loop, not recovering > Checking the heap usage stats, there's a sudden jump of 1-3GB. Some nodes > recover, but some don't and log this: > {noformat} > 2017-03-21T11:23:02.957-0400: 54099.519: [Full GC (Allocation Failure) > 13G->11G(14G), 29.4127307 secs] > 2017-03-21T11:23:45.270-0400: 54141.833: [Full GC (Allocation Failure) > 13G->12G(14G), 28.1561881 secs] > 2017-03-21T11:24:20.307-0400: 54176.869: [Full GC (Allocation Failure) > 13G->13G(14G), 27.7019501 secs] > 2017-03-21T11:24:50.528-0400: 54207.090: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1372267 secs] > 2017-03-21T11:25:19.190-0400: 54235.752: [Full GC (Allocation Failure) > 13G->13G(14G), 27.0703975 secs] > 2017-03-21T11:25:46.711-0400: 54263.273: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3187768 secs] > 2017-03-21T11:26:15.419-0400: 54291.981: [Full GC (Allocation Failure) > 13G->13G(14G), 26.9493405 secs] > 2017-03-21T11:26:43.399-0400: 54319.961: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5222085 secs] > 2017-03-21T11:27:11.383-0400: 54347.945: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1769581 secs] > 2017-03-21T11:27:40.174-0400: 54376.737: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4639031 secs] > 2017-03-21T11:28:08.946-0400: 54405.508: [Full GC (Allocation Failure) > 13G->13G(14G), 30.3480523 secs] > 2017-03-21T11:28:40.117-0400: 54436.680: [Full GC (Allocation Failure) > 13G->13G(14G), 27.8220513 secs] > 2017-03-21T11:29:08.459-0400: 54465.022: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4691271 secs] > 2017-03-21T11:29:37.114-0400: 54493.676: [Full GC (Allocation Failure) > 13G->13G(14G), 27.0275733 secs] > 2017-03-21T11:30:04.635-0400: 54521.198: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1902627 secs] > 2017-03-21T11:30:32.114-0400: 54548.676: [Full GC (Allocation Failure) > 13G->13G(14G), 27.8872850 secs] > 2017-03-21T11:31:01.430-0400: 54577.993: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1609706 secs] > 2017-03-21T11:31:29.024-0400: 54605.587: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3635138 secs] > 2017-03-21T11:31:57.303-0400: 54633.865: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4143510 secs] > 2017-03-21T11:32:25.110-0400: 54661.672: [Full GC (Allocation Failure) > 13G->13G(14G), 27.8595986 secs] > 2017-03-21T11:32:53.922-0400: 54690.485: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5242543 secs] > 2017-03-21T11:33:21.867-0400: 54718.429: [Full GC (Allocation Failure) > 13G->13G(14G), 30.8930130 secs] > 2017-03-21T11:33:53.712-0400: 54750.275: [Full GC (Allocation Failure) > 13G->13G(14G), 27.6523013 secs] > 2017-03-21T11:34:21.760-0400: 54778.322: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3030198 secs] > 2017-03-21T11:34:50.073-0400: 54806.635: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1594154 secs] > 2017-03-21T11:35:17.743-0400: 54834.306: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3766949 secs] > 2017-03-21T11:35:45.797-0400: 54862.360: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5756770 secs] > 2017-03-21T11:36:13.816-0400: 54890.378: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5541813 secs] > 2017-03-21T11:36:41.926-0400: 54918.488: [Full GC (Allocation Failure) > 13G->13G(14G), 33.7510103 secs] > 2017-03-21T11:37:16.132-0400: 54952.695: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4856611 secs] > 2017-03-21T11:37:44.454-0400: 54981.017: [Full GC (Allocation Failure) > 13G->13G(14G), 28.1269335 secs] > 2017-03-21T11:38:12.774-0400: 55009.337: [Full GC (Allocation Failure) > 13G->13G(14G), 27.7830448 secs] > 2017-03-21T11:38:40.840-0400: 55037.402: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3527326 secs] > 2017-03-21T11:39:08.610-0400: 55065.173: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5828941 secs] >
[jira] [Commented] (CASSANDRA-13365) Nodes entering GC loop, does not recover
[ https://issues.apache.org/jira/browse/CASSANDRA-13365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528902#comment-17528902 ] Doug Whitfield commented on CASSANDRA-13365: Do we know if this is a problem in the 4.x series? Also, I don't see any reference to a version after 3.11.3 and the last comment was in 2018. Do we think maybe this got fixed by accident with some of the other improvements? > Nodes entering GC loop, does not recover > > > Key: CASSANDRA-13365 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13365 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Core > Environment: 34-node cluster over 4 DCs > Linux CentOS 7.2 x86 > Mix of 64GB/128GB RAM / node > Mix of 32/40 hardware threads / node, Xeon ~2.4Ghz > High read volume, low write volume, occasional sstable bulk loading >Reporter: Mina Naguib >Priority: Normal > > Over the last week we've been observing two related problems affecting our > Cassandra cluster > Problem 1: 1-few nodes per DC entering GC loop, not recovering > Checking the heap usage stats, there's a sudden jump of 1-3GB. Some nodes > recover, but some don't and log this: > {noformat} > 2017-03-21T11:23:02.957-0400: 54099.519: [Full GC (Allocation Failure) > 13G->11G(14G), 29.4127307 secs] > 2017-03-21T11:23:45.270-0400: 54141.833: [Full GC (Allocation Failure) > 13G->12G(14G), 28.1561881 secs] > 2017-03-21T11:24:20.307-0400: 54176.869: [Full GC (Allocation Failure) > 13G->13G(14G), 27.7019501 secs] > 2017-03-21T11:24:50.528-0400: 54207.090: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1372267 secs] > 2017-03-21T11:25:19.190-0400: 54235.752: [Full GC (Allocation Failure) > 13G->13G(14G), 27.0703975 secs] > 2017-03-21T11:25:46.711-0400: 54263.273: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3187768 secs] > 2017-03-21T11:26:15.419-0400: 54291.981: [Full GC (Allocation Failure) > 13G->13G(14G), 26.9493405 secs] > 2017-03-21T11:26:43.399-0400: 54319.961: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5222085 secs] > 2017-03-21T11:27:11.383-0400: 54347.945: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1769581 secs] > 2017-03-21T11:27:40.174-0400: 54376.737: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4639031 secs] > 2017-03-21T11:28:08.946-0400: 54405.508: [Full GC (Allocation Failure) > 13G->13G(14G), 30.3480523 secs] > 2017-03-21T11:28:40.117-0400: 54436.680: [Full GC (Allocation Failure) > 13G->13G(14G), 27.8220513 secs] > 2017-03-21T11:29:08.459-0400: 54465.022: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4691271 secs] > 2017-03-21T11:29:37.114-0400: 54493.676: [Full GC (Allocation Failure) > 13G->13G(14G), 27.0275733 secs] > 2017-03-21T11:30:04.635-0400: 54521.198: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1902627 secs] > 2017-03-21T11:30:32.114-0400: 54548.676: [Full GC (Allocation Failure) > 13G->13G(14G), 27.8872850 secs] > 2017-03-21T11:31:01.430-0400: 54577.993: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1609706 secs] > 2017-03-21T11:31:29.024-0400: 54605.587: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3635138 secs] > 2017-03-21T11:31:57.303-0400: 54633.865: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4143510 secs] > 2017-03-21T11:32:25.110-0400: 54661.672: [Full GC (Allocation Failure) > 13G->13G(14G), 27.8595986 secs] > 2017-03-21T11:32:53.922-0400: 54690.485: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5242543 secs] > 2017-03-21T11:33:21.867-0400: 54718.429: [Full GC (Allocation Failure) > 13G->13G(14G), 30.8930130 secs] > 2017-03-21T11:33:53.712-0400: 54750.275: [Full GC (Allocation Failure) > 13G->13G(14G), 27.6523013 secs] > 2017-03-21T11:34:21.760-0400: 54778.322: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3030198 secs] > 2017-03-21T11:34:50.073-0400: 54806.635: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1594154 secs] > 2017-03-21T11:35:17.743-0400: 54834.306: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3766949 secs] > 2017-03-21T11:35:45.797-0400: 54862.360: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5756770 secs] > 2017-03-21T11:36:13.816-0400: 54890.378: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5541813 secs] > 2017-03-21T11:36:41.926-0400: 54918.488: [Full GC (Allocation Failure) > 13G->13G(14G), 33.7510103 secs] > 2017-03-21T11:37:16.132-0400: 54952.695: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4856611 secs] > 2017-03-21T11:37:44.454-0400: 54981.017: [Full GC (Allocation Failure) > 13G->13G(14G), 28.1269335 secs] > 2017-03-21T11:38:12.774-0400: 55009.337: [Full GC (Allocation Failure) > 13G->13G(14G), 27.7830448 secs] > 2017-03-21T11:38:40.840-0400: 55037.402: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3527326 secs] > 2017-03-21T11:39:08.610-0400:
[jira] [Commented] (CASSANDRA-13365) Nodes entering GC loop, does not recover
[ https://issues.apache.org/jira/browse/CASSANDRA-13365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16726569#comment-16726569 ] Tuo Zhu commented on CASSANDRA-13365: - [~rushman] We don't have any materialized views. But I found the data is not balanced in our 3 nodes cluster. I'm not sure if this related to the issue. UN 172.16.213.213 1.46 TiB 256 ? c5853306-d369-46b6-94bb-cc38ff4ab0ff rack1 UN 172.16.213.215 1.05 TiB 256 ? 1ab8e142-0b1b-454f-97fa-ba4959e7e5e2 rack1 UN 172.16.213.217 1.09 TiB 256 ? 611b74e5-5d06-4a54-8bc0-39159d469ada rack1 I don't know if it's normal. We're using 256 tokens on each node. > Nodes entering GC loop, does not recover > > > Key: CASSANDRA-13365 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13365 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: 34-node cluster over 4 DCs > Linux CentOS 7.2 x86 > Mix of 64GB/128GB RAM / node > Mix of 32/40 hardware threads / node, Xeon ~2.4Ghz > High read volume, low write volume, occasional sstable bulk loading >Reporter: Mina Naguib >Priority: Major > > Over the last week we've been observing two related problems affecting our > Cassandra cluster > Problem 1: 1-few nodes per DC entering GC loop, not recovering > Checking the heap usage stats, there's a sudden jump of 1-3GB. Some nodes > recover, but some don't and log this: > {noformat} > 2017-03-21T11:23:02.957-0400: 54099.519: [Full GC (Allocation Failure) > 13G->11G(14G), 29.4127307 secs] > 2017-03-21T11:23:45.270-0400: 54141.833: [Full GC (Allocation Failure) > 13G->12G(14G), 28.1561881 secs] > 2017-03-21T11:24:20.307-0400: 54176.869: [Full GC (Allocation Failure) > 13G->13G(14G), 27.7019501 secs] > 2017-03-21T11:24:50.528-0400: 54207.090: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1372267 secs] > 2017-03-21T11:25:19.190-0400: 54235.752: [Full GC (Allocation Failure) > 13G->13G(14G), 27.0703975 secs] > 2017-03-21T11:25:46.711-0400: 54263.273: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3187768 secs] > 2017-03-21T11:26:15.419-0400: 54291.981: [Full GC (Allocation Failure) > 13G->13G(14G), 26.9493405 secs] > 2017-03-21T11:26:43.399-0400: 54319.961: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5222085 secs] > 2017-03-21T11:27:11.383-0400: 54347.945: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1769581 secs] > 2017-03-21T11:27:40.174-0400: 54376.737: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4639031 secs] > 2017-03-21T11:28:08.946-0400: 54405.508: [Full GC (Allocation Failure) > 13G->13G(14G), 30.3480523 secs] > 2017-03-21T11:28:40.117-0400: 54436.680: [Full GC (Allocation Failure) > 13G->13G(14G), 27.8220513 secs] > 2017-03-21T11:29:08.459-0400: 54465.022: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4691271 secs] > 2017-03-21T11:29:37.114-0400: 54493.676: [Full GC (Allocation Failure) > 13G->13G(14G), 27.0275733 secs] > 2017-03-21T11:30:04.635-0400: 54521.198: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1902627 secs] > 2017-03-21T11:30:32.114-0400: 54548.676: [Full GC (Allocation Failure) > 13G->13G(14G), 27.8872850 secs] > 2017-03-21T11:31:01.430-0400: 54577.993: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1609706 secs] > 2017-03-21T11:31:29.024-0400: 54605.587: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3635138 secs] > 2017-03-21T11:31:57.303-0400: 54633.865: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4143510 secs] > 2017-03-21T11:32:25.110-0400: 54661.672: [Full GC (Allocation Failure) > 13G->13G(14G), 27.8595986 secs] > 2017-03-21T11:32:53.922-0400: 54690.485: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5242543 secs] > 2017-03-21T11:33:21.867-0400: 54718.429: [Full GC (Allocation Failure) > 13G->13G(14G), 30.8930130 secs] > 2017-03-21T11:33:53.712-0400: 54750.275: [Full GC (Allocation Failure) > 13G->13G(14G), 27.6523013 secs] > 2017-03-21T11:34:21.760-0400: 54778.322: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3030198 secs] > 2017-03-21T11:34:50.073-0400: 54806.635: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1594154 secs] > 2017-03-21T11:35:17.743-0400: 54834.306: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3766949 secs] > 2017-03-21T11:35:45.797-0400: 54862.360: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5756770 secs] > 2017-03-21T11:36:13.816-0400: 54890.378: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5541813 secs] > 2017-03-21T11:36:41.926-0400: 54918.488: [Full GC (Allocation Failure) > 13G->13G(14G), 33.7510103 secs] > 2017-03-21T11:37:16.132-0400: 54952.695: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4856611 secs] > 2017-03-21T11:37:44.454-0400: 54981.017: [Full GC (Allocation Failure) > 13G->13G(14G), 28.1269335 secs] > 2017-03-21T11:38:12.774-0400: 55009.337: [Full
[jira] [Commented] (CASSANDRA-13365) Nodes entering GC loop, does not recover
[ https://issues.apache.org/jira/browse/CASSANDRA-13365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704616#comment-16704616 ] Sergey Kirillov commented on CASSANDRA-13365: - [~sickcate] do you have any materialized views in your database? > Nodes entering GC loop, does not recover > > > Key: CASSANDRA-13365 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13365 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: 34-node cluster over 4 DCs > Linux CentOS 7.2 x86 > Mix of 64GB/128GB RAM / node > Mix of 32/40 hardware threads / node, Xeon ~2.4Ghz > High read volume, low write volume, occasional sstable bulk loading >Reporter: Mina Naguib >Priority: Major > > Over the last week we've been observing two related problems affecting our > Cassandra cluster > Problem 1: 1-few nodes per DC entering GC loop, not recovering > Checking the heap usage stats, there's a sudden jump of 1-3GB. Some nodes > recover, but some don't and log this: > {noformat} > 2017-03-21T11:23:02.957-0400: 54099.519: [Full GC (Allocation Failure) > 13G->11G(14G), 29.4127307 secs] > 2017-03-21T11:23:45.270-0400: 54141.833: [Full GC (Allocation Failure) > 13G->12G(14G), 28.1561881 secs] > 2017-03-21T11:24:20.307-0400: 54176.869: [Full GC (Allocation Failure) > 13G->13G(14G), 27.7019501 secs] > 2017-03-21T11:24:50.528-0400: 54207.090: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1372267 secs] > 2017-03-21T11:25:19.190-0400: 54235.752: [Full GC (Allocation Failure) > 13G->13G(14G), 27.0703975 secs] > 2017-03-21T11:25:46.711-0400: 54263.273: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3187768 secs] > 2017-03-21T11:26:15.419-0400: 54291.981: [Full GC (Allocation Failure) > 13G->13G(14G), 26.9493405 secs] > 2017-03-21T11:26:43.399-0400: 54319.961: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5222085 secs] > 2017-03-21T11:27:11.383-0400: 54347.945: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1769581 secs] > 2017-03-21T11:27:40.174-0400: 54376.737: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4639031 secs] > 2017-03-21T11:28:08.946-0400: 54405.508: [Full GC (Allocation Failure) > 13G->13G(14G), 30.3480523 secs] > 2017-03-21T11:28:40.117-0400: 54436.680: [Full GC (Allocation Failure) > 13G->13G(14G), 27.8220513 secs] > 2017-03-21T11:29:08.459-0400: 54465.022: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4691271 secs] > 2017-03-21T11:29:37.114-0400: 54493.676: [Full GC (Allocation Failure) > 13G->13G(14G), 27.0275733 secs] > 2017-03-21T11:30:04.635-0400: 54521.198: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1902627 secs] > 2017-03-21T11:30:32.114-0400: 54548.676: [Full GC (Allocation Failure) > 13G->13G(14G), 27.8872850 secs] > 2017-03-21T11:31:01.430-0400: 54577.993: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1609706 secs] > 2017-03-21T11:31:29.024-0400: 54605.587: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3635138 secs] > 2017-03-21T11:31:57.303-0400: 54633.865: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4143510 secs] > 2017-03-21T11:32:25.110-0400: 54661.672: [Full GC (Allocation Failure) > 13G->13G(14G), 27.8595986 secs] > 2017-03-21T11:32:53.922-0400: 54690.485: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5242543 secs] > 2017-03-21T11:33:21.867-0400: 54718.429: [Full GC (Allocation Failure) > 13G->13G(14G), 30.8930130 secs] > 2017-03-21T11:33:53.712-0400: 54750.275: [Full GC (Allocation Failure) > 13G->13G(14G), 27.6523013 secs] > 2017-03-21T11:34:21.760-0400: 54778.322: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3030198 secs] > 2017-03-21T11:34:50.073-0400: 54806.635: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1594154 secs] > 2017-03-21T11:35:17.743-0400: 54834.306: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3766949 secs] > 2017-03-21T11:35:45.797-0400: 54862.360: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5756770 secs] > 2017-03-21T11:36:13.816-0400: 54890.378: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5541813 secs] > 2017-03-21T11:36:41.926-0400: 54918.488: [Full GC (Allocation Failure) > 13G->13G(14G), 33.7510103 secs] > 2017-03-21T11:37:16.132-0400: 54952.695: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4856611 secs] > 2017-03-21T11:37:44.454-0400: 54981.017: [Full GC (Allocation Failure) > 13G->13G(14G), 28.1269335 secs] > 2017-03-21T11:38:12.774-0400: 55009.337: [Full GC (Allocation Failure) > 13G->13G(14G), 27.7830448 secs] > 2017-03-21T11:38:40.840-0400: 55037.402: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3527326 secs] > 2017-03-21T11:39:08.610-0400: 55065.173: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5828941 secs] > 2017-03-21T11:39:36.833-0400: 55093.396: [Full GC (Allocation Failure) > 13G->13G(14G),
[jira] [Commented] (CASSANDRA-13365) Nodes entering GC loop, does not recover
[ https://issues.apache.org/jira/browse/CASSANDRA-13365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704403#comment-16704403 ] Tuo Zhu commented on CASSANDRA-13365: - We have the exact same issue on our 3 nodes 3.11.3 cluster. Our data model is kinda like a graph database. There are several types of nodes and one big table to store edges from one node to another. Edge table uses node id as partition keys. Some node has huge amount of edges to other ones(10s of millions), so we added another shard_key into partition keys to make partition smaller. From the log, large partitions now at size 100MB ~ 600MB, but the issue isn't related to "large partition write", from our experiences. When we are running heavy loading task on the cluster, some of the nodes stuck in Full GC cycle just like OP described. It happens every 10 - 30 hours and won't recover even if the loads stopped for hours. Our node has 16C/32T and 64G memory. I'm using G1 and 24G xmx. I'll try a dump analyze later. > Nodes entering GC loop, does not recover > > > Key: CASSANDRA-13365 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13365 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: 34-node cluster over 4 DCs > Linux CentOS 7.2 x86 > Mix of 64GB/128GB RAM / node > Mix of 32/40 hardware threads / node, Xeon ~2.4Ghz > High read volume, low write volume, occasional sstable bulk loading >Reporter: Mina Naguib >Priority: Major > > Over the last week we've been observing two related problems affecting our > Cassandra cluster > Problem 1: 1-few nodes per DC entering GC loop, not recovering > Checking the heap usage stats, there's a sudden jump of 1-3GB. Some nodes > recover, but some don't and log this: > {noformat} > 2017-03-21T11:23:02.957-0400: 54099.519: [Full GC (Allocation Failure) > 13G->11G(14G), 29.4127307 secs] > 2017-03-21T11:23:45.270-0400: 54141.833: [Full GC (Allocation Failure) > 13G->12G(14G), 28.1561881 secs] > 2017-03-21T11:24:20.307-0400: 54176.869: [Full GC (Allocation Failure) > 13G->13G(14G), 27.7019501 secs] > 2017-03-21T11:24:50.528-0400: 54207.090: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1372267 secs] > 2017-03-21T11:25:19.190-0400: 54235.752: [Full GC (Allocation Failure) > 13G->13G(14G), 27.0703975 secs] > 2017-03-21T11:25:46.711-0400: 54263.273: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3187768 secs] > 2017-03-21T11:26:15.419-0400: 54291.981: [Full GC (Allocation Failure) > 13G->13G(14G), 26.9493405 secs] > 2017-03-21T11:26:43.399-0400: 54319.961: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5222085 secs] > 2017-03-21T11:27:11.383-0400: 54347.945: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1769581 secs] > 2017-03-21T11:27:40.174-0400: 54376.737: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4639031 secs] > 2017-03-21T11:28:08.946-0400: 54405.508: [Full GC (Allocation Failure) > 13G->13G(14G), 30.3480523 secs] > 2017-03-21T11:28:40.117-0400: 54436.680: [Full GC (Allocation Failure) > 13G->13G(14G), 27.8220513 secs] > 2017-03-21T11:29:08.459-0400: 54465.022: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4691271 secs] > 2017-03-21T11:29:37.114-0400: 54493.676: [Full GC (Allocation Failure) > 13G->13G(14G), 27.0275733 secs] > 2017-03-21T11:30:04.635-0400: 54521.198: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1902627 secs] > 2017-03-21T11:30:32.114-0400: 54548.676: [Full GC (Allocation Failure) > 13G->13G(14G), 27.8872850 secs] > 2017-03-21T11:31:01.430-0400: 54577.993: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1609706 secs] > 2017-03-21T11:31:29.024-0400: 54605.587: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3635138 secs] > 2017-03-21T11:31:57.303-0400: 54633.865: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4143510 secs] > 2017-03-21T11:32:25.110-0400: 54661.672: [Full GC (Allocation Failure) > 13G->13G(14G), 27.8595986 secs] > 2017-03-21T11:32:53.922-0400: 54690.485: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5242543 secs] > 2017-03-21T11:33:21.867-0400: 54718.429: [Full GC (Allocation Failure) > 13G->13G(14G), 30.8930130 secs] > 2017-03-21T11:33:53.712-0400: 54750.275: [Full GC (Allocation Failure) > 13G->13G(14G), 27.6523013 secs] > 2017-03-21T11:34:21.760-0400: 54778.322: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3030198 secs] > 2017-03-21T11:34:50.073-0400: 54806.635: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1594154 secs] > 2017-03-21T11:35:17.743-0400: 54834.306: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3766949 secs] > 2017-03-21T11:35:45.797-0400: 54862.360: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5756770 secs] > 2017-03-21T11:36:13.816-0400: 54890.378: [Full GC (Allocation Failure) > 13G->13G(14G),
[jira] [Commented] (CASSANDRA-13365) Nodes entering GC loop, does not recover
[ https://issues.apache.org/jira/browse/CASSANDRA-13365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15976684#comment-15976684 ] Mina Naguib commented on CASSANDRA-13365: - [~jasonstack] [~jjirsa] The workload on the whole cluster falls into one of these 3 cases: 1. Full single-row select: select * from users where key=XXX High volume, > 1M/s 2. Single column insert: insert into users (key,column1,column2,column3,value) values (?,?,?,?,?) using TTL ? Low volume, 10k/s 3. Batch loading with sstableloader Few times a day We do not do any "analytic"-style workload on the cassandra cluster (no selects multiple keys, high limit, etc..). Also I believe (what's the best way to check ?) that we don't have "fat rows". Most rows should average 1k-1.5k Re concurrent read threads, if you mean the `concurrent_reads` setting, it's a reasonable 32 Re sjk-plus/jvm-tools, I wasn't aware of it. I'll give it a shot, but if it relies on JMX being cooperative I don't know if I'll have much luck when a node gets stuck (from my experience the JVM is too busy stuck to do any real work, including answering JMX queries). Will report back. > Nodes entering GC loop, does not recover > > > Key: CASSANDRA-13365 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13365 > Project: Cassandra > Issue Type: Bug > Environment: 34-node cluster over 4 DCs > Linux CentOS 7.2 x86 > Mix of 64GB/128GB RAM / node > Mix of 32/40 hardware threads / node, Xeon ~2.4Ghz > High read volume, low write volume, occasional sstable bulk loading >Reporter: Mina Naguib > > Over the last week we've been observing two related problems affecting our > Cassandra cluster > Problem 1: 1-few nodes per DC entering GC loop, not recovering > Checking the heap usage stats, there's a sudden jump of 1-3GB. Some nodes > recover, but some don't and log this: > {noformat} > 2017-03-21T11:23:02.957-0400: 54099.519: [Full GC (Allocation Failure) > 13G->11G(14G), 29.4127307 secs] > 2017-03-21T11:23:45.270-0400: 54141.833: [Full GC (Allocation Failure) > 13G->12G(14G), 28.1561881 secs] > 2017-03-21T11:24:20.307-0400: 54176.869: [Full GC (Allocation Failure) > 13G->13G(14G), 27.7019501 secs] > 2017-03-21T11:24:50.528-0400: 54207.090: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1372267 secs] > 2017-03-21T11:25:19.190-0400: 54235.752: [Full GC (Allocation Failure) > 13G->13G(14G), 27.0703975 secs] > 2017-03-21T11:25:46.711-0400: 54263.273: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3187768 secs] > 2017-03-21T11:26:15.419-0400: 54291.981: [Full GC (Allocation Failure) > 13G->13G(14G), 26.9493405 secs] > 2017-03-21T11:26:43.399-0400: 54319.961: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5222085 secs] > 2017-03-21T11:27:11.383-0400: 54347.945: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1769581 secs] > 2017-03-21T11:27:40.174-0400: 54376.737: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4639031 secs] > 2017-03-21T11:28:08.946-0400: 54405.508: [Full GC (Allocation Failure) > 13G->13G(14G), 30.3480523 secs] > 2017-03-21T11:28:40.117-0400: 54436.680: [Full GC (Allocation Failure) > 13G->13G(14G), 27.8220513 secs] > 2017-03-21T11:29:08.459-0400: 54465.022: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4691271 secs] > 2017-03-21T11:29:37.114-0400: 54493.676: [Full GC (Allocation Failure) > 13G->13G(14G), 27.0275733 secs] > 2017-03-21T11:30:04.635-0400: 54521.198: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1902627 secs] > 2017-03-21T11:30:32.114-0400: 54548.676: [Full GC (Allocation Failure) > 13G->13G(14G), 27.8872850 secs] > 2017-03-21T11:31:01.430-0400: 54577.993: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1609706 secs] > 2017-03-21T11:31:29.024-0400: 54605.587: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3635138 secs] > 2017-03-21T11:31:57.303-0400: 54633.865: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4143510 secs] > 2017-03-21T11:32:25.110-0400: 54661.672: [Full GC (Allocation Failure) > 13G->13G(14G), 27.8595986 secs] > 2017-03-21T11:32:53.922-0400: 54690.485: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5242543 secs] > 2017-03-21T11:33:21.867-0400: 54718.429: [Full GC (Allocation Failure) > 13G->13G(14G), 30.8930130 secs] > 2017-03-21T11:33:53.712-0400: 54750.275: [Full GC (Allocation Failure) > 13G->13G(14G), 27.6523013 secs] > 2017-03-21T11:34:21.760-0400: 54778.322: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3030198 secs] > 2017-03-21T11:34:50.073-0400: 54806.635: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1594154 secs] > 2017-03-21T11:35:17.743-0400: 54834.306: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3766949 secs] > 2017-03-21T11:35:45.797-0400: 54862.360: [Full GC (Allocation Failure) > 13G->13G(14G),
[jira] [Commented] (CASSANDRA-13365) Nodes entering GC loop, does not recover
[ https://issues.apache.org/jira/browse/CASSANDRA-13365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15975075#comment-15975075 ] Jeff Jirsa commented on CASSANDRA-13365: {code} 9: 2870889 160769784 org.apache.cassandra.transport.messages.ResultMessage$Rows 10: 2937336 140992128 io.netty.buffer.SlicedAbstractByteBuf 11: 8854 118773984 [Lio.netty.util.Recycler$DefaultHandle; 12: 2830805 113232200 org.apache.cassandra.db.rows.BufferCell 13: 2937336 93994752 org.apache.cassandra.transport.Frame$Header 14: 2870928 91869696 org.apache.cassandra.cql3.ResultSet$ResultMetadata 15: 2728627 87316064 org.apache.cassandra.db.rows.BTreeRow {code} 2.8M ResultMessage$Rows, BufferCells, ResultSets, and BTreeRows suggests you have an awful lot of read results in flight, and you've filled the heap with them. Is it possible someone is doing a query with a very, very large LIMIT rather than using driver's fetchSize() or manual paging? Or do you have concurrent read threads turned up very high? > Nodes entering GC loop, does not recover > > > Key: CASSANDRA-13365 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13365 > Project: Cassandra > Issue Type: Bug > Environment: 34-node cluster over 4 DCs > Linux CentOS 7.2 x86 > Mix of 64GB/128GB RAM / node > Mix of 32/40 hardware threads / node, Xeon ~2.4Ghz > High read volume, low write volume, occasional sstable bulk loading >Reporter: Mina Naguib > > Over the last week we've been observing two related problems affecting our > Cassandra cluster > Problem 1: 1-few nodes per DC entering GC loop, not recovering > Checking the heap usage stats, there's a sudden jump of 1-3GB. Some nodes > recover, but some don't and log this: > {noformat} > 2017-03-21T11:23:02.957-0400: 54099.519: [Full GC (Allocation Failure) > 13G->11G(14G), 29.4127307 secs] > 2017-03-21T11:23:45.270-0400: 54141.833: [Full GC (Allocation Failure) > 13G->12G(14G), 28.1561881 secs] > 2017-03-21T11:24:20.307-0400: 54176.869: [Full GC (Allocation Failure) > 13G->13G(14G), 27.7019501 secs] > 2017-03-21T11:24:50.528-0400: 54207.090: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1372267 secs] > 2017-03-21T11:25:19.190-0400: 54235.752: [Full GC (Allocation Failure) > 13G->13G(14G), 27.0703975 secs] > 2017-03-21T11:25:46.711-0400: 54263.273: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3187768 secs] > 2017-03-21T11:26:15.419-0400: 54291.981: [Full GC (Allocation Failure) > 13G->13G(14G), 26.9493405 secs] > 2017-03-21T11:26:43.399-0400: 54319.961: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5222085 secs] > 2017-03-21T11:27:11.383-0400: 54347.945: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1769581 secs] > 2017-03-21T11:27:40.174-0400: 54376.737: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4639031 secs] > 2017-03-21T11:28:08.946-0400: 54405.508: [Full GC (Allocation Failure) > 13G->13G(14G), 30.3480523 secs] > 2017-03-21T11:28:40.117-0400: 54436.680: [Full GC (Allocation Failure) > 13G->13G(14G), 27.8220513 secs] > 2017-03-21T11:29:08.459-0400: 54465.022: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4691271 secs] > 2017-03-21T11:29:37.114-0400: 54493.676: [Full GC (Allocation Failure) > 13G->13G(14G), 27.0275733 secs] > 2017-03-21T11:30:04.635-0400: 54521.198: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1902627 secs] > 2017-03-21T11:30:32.114-0400: 54548.676: [Full GC (Allocation Failure) > 13G->13G(14G), 27.8872850 secs] > 2017-03-21T11:31:01.430-0400: 54577.993: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1609706 secs] > 2017-03-21T11:31:29.024-0400: 54605.587: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3635138 secs] > 2017-03-21T11:31:57.303-0400: 54633.865: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4143510 secs] > 2017-03-21T11:32:25.110-0400: 54661.672: [Full GC (Allocation Failure) > 13G->13G(14G), 27.8595986 secs] > 2017-03-21T11:32:53.922-0400: 54690.485: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5242543 secs] > 2017-03-21T11:33:21.867-0400: 54718.429: [Full GC (Allocation Failure) > 13G->13G(14G), 30.8930130 secs] > 2017-03-21T11:33:53.712-0400: 54750.275: [Full GC (Allocation Failure) > 13G->13G(14G), 27.6523013 secs] > 2017-03-21T11:34:21.760-0400: 54778.322: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3030198 secs] > 2017-03-21T11:34:50.073-0400: 54806.635: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1594154 secs] > 2017-03-21T11:35:17.743-0400: 54834.306: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3766949 secs] > 2017-03-21T11:35:45.797-0400: 54862.360: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5756770 secs] > 2017-03-21T11:36:13.816-0400: 54890.378: [Full GC
[jira] [Commented] (CASSANDRA-13365) Nodes entering GC loop, does not recover
[ https://issues.apache.org/jira/browse/CASSANDRA-13365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15975031#comment-15975031 ] ZhaoYang commented on CASSANDRA-13365: -- [~minaguib] could you share what queries are running at that moment? or you could try `sjk-plus` to see which thread is allocating huge memory > Nodes entering GC loop, does not recover > > > Key: CASSANDRA-13365 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13365 > Project: Cassandra > Issue Type: Bug > Environment: 34-node cluster over 4 DCs > Linux CentOS 7.2 x86 > Mix of 64GB/128GB RAM / node > Mix of 32/40 hardware threads / node, Xeon ~2.4Ghz > High read volume, low write volume, occasional sstable bulk loading >Reporter: Mina Naguib > > Over the last week we've been observing two related problems affecting our > Cassandra cluster > Problem 1: 1-few nodes per DC entering GC loop, not recovering > Checking the heap usage stats, there's a sudden jump of 1-3GB. Some nodes > recover, but some don't and log this: > {noformat} > 2017-03-21T11:23:02.957-0400: 54099.519: [Full GC (Allocation Failure) > 13G->11G(14G), 29.4127307 secs] > 2017-03-21T11:23:45.270-0400: 54141.833: [Full GC (Allocation Failure) > 13G->12G(14G), 28.1561881 secs] > 2017-03-21T11:24:20.307-0400: 54176.869: [Full GC (Allocation Failure) > 13G->13G(14G), 27.7019501 secs] > 2017-03-21T11:24:50.528-0400: 54207.090: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1372267 secs] > 2017-03-21T11:25:19.190-0400: 54235.752: [Full GC (Allocation Failure) > 13G->13G(14G), 27.0703975 secs] > 2017-03-21T11:25:46.711-0400: 54263.273: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3187768 secs] > 2017-03-21T11:26:15.419-0400: 54291.981: [Full GC (Allocation Failure) > 13G->13G(14G), 26.9493405 secs] > 2017-03-21T11:26:43.399-0400: 54319.961: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5222085 secs] > 2017-03-21T11:27:11.383-0400: 54347.945: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1769581 secs] > 2017-03-21T11:27:40.174-0400: 54376.737: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4639031 secs] > 2017-03-21T11:28:08.946-0400: 54405.508: [Full GC (Allocation Failure) > 13G->13G(14G), 30.3480523 secs] > 2017-03-21T11:28:40.117-0400: 54436.680: [Full GC (Allocation Failure) > 13G->13G(14G), 27.8220513 secs] > 2017-03-21T11:29:08.459-0400: 54465.022: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4691271 secs] > 2017-03-21T11:29:37.114-0400: 54493.676: [Full GC (Allocation Failure) > 13G->13G(14G), 27.0275733 secs] > 2017-03-21T11:30:04.635-0400: 54521.198: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1902627 secs] > 2017-03-21T11:30:32.114-0400: 54548.676: [Full GC (Allocation Failure) > 13G->13G(14G), 27.8872850 secs] > 2017-03-21T11:31:01.430-0400: 54577.993: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1609706 secs] > 2017-03-21T11:31:29.024-0400: 54605.587: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3635138 secs] > 2017-03-21T11:31:57.303-0400: 54633.865: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4143510 secs] > 2017-03-21T11:32:25.110-0400: 54661.672: [Full GC (Allocation Failure) > 13G->13G(14G), 27.8595986 secs] > 2017-03-21T11:32:53.922-0400: 54690.485: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5242543 secs] > 2017-03-21T11:33:21.867-0400: 54718.429: [Full GC (Allocation Failure) > 13G->13G(14G), 30.8930130 secs] > 2017-03-21T11:33:53.712-0400: 54750.275: [Full GC (Allocation Failure) > 13G->13G(14G), 27.6523013 secs] > 2017-03-21T11:34:21.760-0400: 54778.322: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3030198 secs] > 2017-03-21T11:34:50.073-0400: 54806.635: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1594154 secs] > 2017-03-21T11:35:17.743-0400: 54834.306: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3766949 secs] > 2017-03-21T11:35:45.797-0400: 54862.360: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5756770 secs] > 2017-03-21T11:36:13.816-0400: 54890.378: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5541813 secs] > 2017-03-21T11:36:41.926-0400: 54918.488: [Full GC (Allocation Failure) > 13G->13G(14G), 33.7510103 secs] > 2017-03-21T11:37:16.132-0400: 54952.695: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4856611 secs] > 2017-03-21T11:37:44.454-0400: 54981.017: [Full GC (Allocation Failure) > 13G->13G(14G), 28.1269335 secs] > 2017-03-21T11:38:12.774-0400: 55009.337: [Full GC (Allocation Failure) > 13G->13G(14G), 27.7830448 secs] > 2017-03-21T11:38:40.840-0400: 55037.402: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3527326 secs] > 2017-03-21T11:39:08.610-0400: 55065.173: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5828941 secs] > 2017-03-21T11:39:36.833-0400: 55093.396: [Full GC (Allocation Failure) >
[jira] [Commented] (CASSANDRA-13365) Nodes entering GC loop, does not recover
[ https://issues.apache.org/jira/browse/CASSANDRA-13365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942116#comment-15942116 ] Jeremy Hanna commented on CASSANDRA-13365: -- It sounds like you're churning data in the heap for some reason, probably use case and/or data model related. While using G1 and increasing your heap size will help in the short term, the long term solution is to review what is in your heap by doing a heap dump analysis and reviewing your data model and access patterns. It sounds like you were able to get a heap dump that was 20G. Typically what I've done is just get a windows cloud instance with a large amount of memory and copy it there and use eclipse heap analyzer to try to track down what's going on. I would be surprised if this were a bug though or anything to do with Cassandra itself. > Nodes entering GC loop, does not recover > > > Key: CASSANDRA-13365 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13365 > Project: Cassandra > Issue Type: Bug > Environment: 34-node cluster over 4 DCs > Linux CentOS 7.2 x86 > Mix of 64GB/128GB RAM / node > Mix of 32/40 hardware threads / node, Xeon ~2.4Ghz > High read volume, low write volume, occasional sstable bulk loading >Reporter: Mina Naguib > > Over the last week we've been observing two related problems affecting our > Cassandra cluster > Problem 1: 1-few nodes per DC entering GC loop, not recovering > Checking the heap usage stats, there's a sudden jump of 1-3GB. Some nodes > recover, but some don't and log this: > {noformat} > 2017-03-21T11:23:02.957-0400: 54099.519: [Full GC (Allocation Failure) > 13G->11G(14G), 29.4127307 secs] > 2017-03-21T11:23:45.270-0400: 54141.833: [Full GC (Allocation Failure) > 13G->12G(14G), 28.1561881 secs] > 2017-03-21T11:24:20.307-0400: 54176.869: [Full GC (Allocation Failure) > 13G->13G(14G), 27.7019501 secs] > 2017-03-21T11:24:50.528-0400: 54207.090: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1372267 secs] > 2017-03-21T11:25:19.190-0400: 54235.752: [Full GC (Allocation Failure) > 13G->13G(14G), 27.0703975 secs] > 2017-03-21T11:25:46.711-0400: 54263.273: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3187768 secs] > 2017-03-21T11:26:15.419-0400: 54291.981: [Full GC (Allocation Failure) > 13G->13G(14G), 26.9493405 secs] > 2017-03-21T11:26:43.399-0400: 54319.961: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5222085 secs] > 2017-03-21T11:27:11.383-0400: 54347.945: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1769581 secs] > 2017-03-21T11:27:40.174-0400: 54376.737: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4639031 secs] > 2017-03-21T11:28:08.946-0400: 54405.508: [Full GC (Allocation Failure) > 13G->13G(14G), 30.3480523 secs] > 2017-03-21T11:28:40.117-0400: 54436.680: [Full GC (Allocation Failure) > 13G->13G(14G), 27.8220513 secs] > 2017-03-21T11:29:08.459-0400: 54465.022: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4691271 secs] > 2017-03-21T11:29:37.114-0400: 54493.676: [Full GC (Allocation Failure) > 13G->13G(14G), 27.0275733 secs] > 2017-03-21T11:30:04.635-0400: 54521.198: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1902627 secs] > 2017-03-21T11:30:32.114-0400: 54548.676: [Full GC (Allocation Failure) > 13G->13G(14G), 27.8872850 secs] > 2017-03-21T11:31:01.430-0400: 54577.993: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1609706 secs] > 2017-03-21T11:31:29.024-0400: 54605.587: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3635138 secs] > 2017-03-21T11:31:57.303-0400: 54633.865: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4143510 secs] > 2017-03-21T11:32:25.110-0400: 54661.672: [Full GC (Allocation Failure) > 13G->13G(14G), 27.8595986 secs] > 2017-03-21T11:32:53.922-0400: 54690.485: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5242543 secs] > 2017-03-21T11:33:21.867-0400: 54718.429: [Full GC (Allocation Failure) > 13G->13G(14G), 30.8930130 secs] > 2017-03-21T11:33:53.712-0400: 54750.275: [Full GC (Allocation Failure) > 13G->13G(14G), 27.6523013 secs] > 2017-03-21T11:34:21.760-0400: 54778.322: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3030198 secs] > 2017-03-21T11:34:50.073-0400: 54806.635: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1594154 secs] > 2017-03-21T11:35:17.743-0400: 54834.306: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3766949 secs] > 2017-03-21T11:35:45.797-0400: 54862.360: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5756770 secs] > 2017-03-21T11:36:13.816-0400: 54890.378: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5541813 secs] > 2017-03-21T11:36:41.926-0400: 54918.488: [Full GC (Allocation Failure) > 13G->13G(14G), 33.7510103 secs] > 2017-03-21T11:37:16.132-0400: 54952.695: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4856611
[jira] [Commented] (CASSANDRA-13365) Nodes entering GC loop, does not recover
[ https://issues.apache.org/jira/browse/CASSANDRA-13365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936377#comment-15936377 ] Mina Naguib commented on CASSANDRA-13365: - For the time being, we've mitigated by adding a script on each node that detects it's entered a GC loop (at least 5 logged "Full GC (Allocation Failure)" errors in 5 minutes) and force-restarts the node. I'd appreciate any ideas for further investigation towards a proper fix. > Nodes entering GC loop, does not recover > > > Key: CASSANDRA-13365 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13365 > Project: Cassandra > Issue Type: Bug > Environment: 34-node cluster over 4 DCs > Linux CentOS 7.2 x86 > Mix of 64GB/128GB RAM / node > Mix of 32/40 hardware threads / node, Xeon ~2.4Ghz > High read volume, low write volume, occasional sstable bulk loading >Reporter: Mina Naguib > > Over the last week we've been observing two related problems affecting our > Cassandra cluster > Problem 1: 1-few nodes per DC entering GC loop, not recovering > Checking the heap usage stats, there's a sudden jump of 1-3GB. Some nodes > recover, but some don't and log this: > {noformat} > 2017-03-21T11:23:02.957-0400: 54099.519: [Full GC (Allocation Failure) > 13G->11G(14G), 29.4127307 secs] > 2017-03-21T11:23:45.270-0400: 54141.833: [Full GC (Allocation Failure) > 13G->12G(14G), 28.1561881 secs] > 2017-03-21T11:24:20.307-0400: 54176.869: [Full GC (Allocation Failure) > 13G->13G(14G), 27.7019501 secs] > 2017-03-21T11:24:50.528-0400: 54207.090: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1372267 secs] > 2017-03-21T11:25:19.190-0400: 54235.752: [Full GC (Allocation Failure) > 13G->13G(14G), 27.0703975 secs] > 2017-03-21T11:25:46.711-0400: 54263.273: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3187768 secs] > 2017-03-21T11:26:15.419-0400: 54291.981: [Full GC (Allocation Failure) > 13G->13G(14G), 26.9493405 secs] > 2017-03-21T11:26:43.399-0400: 54319.961: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5222085 secs] > 2017-03-21T11:27:11.383-0400: 54347.945: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1769581 secs] > 2017-03-21T11:27:40.174-0400: 54376.737: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4639031 secs] > 2017-03-21T11:28:08.946-0400: 54405.508: [Full GC (Allocation Failure) > 13G->13G(14G), 30.3480523 secs] > 2017-03-21T11:28:40.117-0400: 54436.680: [Full GC (Allocation Failure) > 13G->13G(14G), 27.8220513 secs] > 2017-03-21T11:29:08.459-0400: 54465.022: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4691271 secs] > 2017-03-21T11:29:37.114-0400: 54493.676: [Full GC (Allocation Failure) > 13G->13G(14G), 27.0275733 secs] > 2017-03-21T11:30:04.635-0400: 54521.198: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1902627 secs] > 2017-03-21T11:30:32.114-0400: 54548.676: [Full GC (Allocation Failure) > 13G->13G(14G), 27.8872850 secs] > 2017-03-21T11:31:01.430-0400: 54577.993: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1609706 secs] > 2017-03-21T11:31:29.024-0400: 54605.587: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3635138 secs] > 2017-03-21T11:31:57.303-0400: 54633.865: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4143510 secs] > 2017-03-21T11:32:25.110-0400: 54661.672: [Full GC (Allocation Failure) > 13G->13G(14G), 27.8595986 secs] > 2017-03-21T11:32:53.922-0400: 54690.485: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5242543 secs] > 2017-03-21T11:33:21.867-0400: 54718.429: [Full GC (Allocation Failure) > 13G->13G(14G), 30.8930130 secs] > 2017-03-21T11:33:53.712-0400: 54750.275: [Full GC (Allocation Failure) > 13G->13G(14G), 27.6523013 secs] > 2017-03-21T11:34:21.760-0400: 54778.322: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3030198 secs] > 2017-03-21T11:34:50.073-0400: 54806.635: [Full GC (Allocation Failure) > 13G->13G(14G), 27.1594154 secs] > 2017-03-21T11:35:17.743-0400: 54834.306: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3766949 secs] > 2017-03-21T11:35:45.797-0400: 54862.360: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5756770 secs] > 2017-03-21T11:36:13.816-0400: 54890.378: [Full GC (Allocation Failure) > 13G->13G(14G), 27.5541813 secs] > 2017-03-21T11:36:41.926-0400: 54918.488: [Full GC (Allocation Failure) > 13G->13G(14G), 33.7510103 secs] > 2017-03-21T11:37:16.132-0400: 54952.695: [Full GC (Allocation Failure) > 13G->13G(14G), 27.4856611 secs] > 2017-03-21T11:37:44.454-0400: 54981.017: [Full GC (Allocation Failure) > 13G->13G(14G), 28.1269335 secs] > 2017-03-21T11:38:12.774-0400: 55009.337: [Full GC (Allocation Failure) > 13G->13G(14G), 27.7830448 secs] > 2017-03-21T11:38:40.840-0400: 55037.402: [Full GC (Allocation Failure) > 13G->13G(14G), 27.3527326 secs] > 2017-03-21T11:39:08.610-0400: 55065.173: [Full GC