[
https://issues.apache.org/jira/browse/CASSANDRA-9681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14607982#comment-14607982
]
Benedict commented on CASSANDRA-9681:
-------------------------------------
Those logs are much more helpful. It suggests the heap dump will not be as
useful as hoped - if it's nearly ready, please do upload it, but if it's
degrading your cluster or causing you problems please feel free not to. Thread
dumps spread over 10 minutely intervals might be helpful, however.
{code}
system.log.2015-06-29_1419:INFO [SlabPoolCleaner] 2015-06-29 08:47:56,322
ColumnFamilyStore.java:906 - Enqueuing flush of entity: 2469606408 (118%)
on-heap, 0 (0%) off-heap
system.log.2015-06-29_1419:INFO [SlabPoolCleaner] 2015-06-29 08:54:09,873
ColumnFamilyStore.java:906 - Enqueuing flush of entity_by_id: 2633134676 (126%)
on-heap, 0 (0%) off-heap
system.log.2015-06-29_1419:INFO [SlabPoolCleaner] 2015-06-29 09:00:21,353
ColumnFamilyStore.java:906 - Enqueuing flush of entity: 4191587102 (200%)
on-heap, 0 (0%) off-heap
system.log.2015-06-29_1419:INFO [SlabPoolCleaner] 2015-06-29 09:10:39,736
ColumnFamilyStore.java:906 - Enqueuing flush of entity: 3455005866 (165%)
on-heap, 0 (0%) off-heap
system.log.2015-06-29_1419:INFO [SlabPoolCleaner] 2015-06-29 09:19:18,950
ColumnFamilyStore.java:906 - Enqueuing flush of entity: 2879879873 (138%)
on-heap, 0 (0%) off-heap
system.log.2015-06-29_1419:INFO [SlabPoolCleaner] 2015-06-29 09:26:28,874
ColumnFamilyStore.java:906 - Enqueuing flush of entity: 2400355359 (115%)
on-heap, 0 (0%) off-heap
system.log.2015-06-29_1419:INFO [SlabPoolCleaner] 2015-06-29 09:32:16,612
ColumnFamilyStore.java:906 - Enqueuing flush of entity_by_id: 2630156881 (126%)
on-heap, 0 (0%) off-heap
system.log.2015-06-29_1419:INFO [SlabPoolCleaner] 2015-06-29 09:38:35,445
ColumnFamilyStore.java:906 - Enqueuing flush of entity: 4232454175 (202%)
on-heap, 0 (0%) off-heap
system.log.2015-06-29_1419:INFO [SlabPoolCleaner] 2015-06-29 09:47:56,679
ColumnFamilyStore.java:906 - Enqueuing flush of entity: 3475375055 (166%)
on-heap, 0 (0%) off-heap
system.log.2015-06-29_1419:INFO [SlabPoolCleaner] 2015-06-29 09:56:44,672
ColumnFamilyStore.java:906 - Enqueuing flush of entity: 3046517722 (146%)
on-heap, 0 (0%) off-heap
system.log.2015-06-29_1419:INFO [SlabPoolCleaner] 2015-06-29 10:04:09,950
ColumnFamilyStore.java:906 - Enqueuing flush of entity: 2522479792 (121%)
on-heap, 0 (0%) off-heap
{code}
So, we can definitely see that things are messed up, and that Cassandra is
aware of it. The real question is why it is permitting this to occur, and more
importantly why it has changed. There is only one reason we ostensibly permit
overshoot of the heap utilisation of memtables, and that is to permit writes
that were already in progress when we decided to flush to complete. This
doesn't seem to adequately explain the massive overshoot, and the behaviour
here hasn't changed between these versions.
Do you by any chance use secondary indexes? These are a potential (but
unlikely) source of large writes that would have permission to exceed the heap
space utilisation, if they were very stale.
Could you upload your schema, and let us know what compaction strategy you're
using?
I think it will likely be necessary to release a patched version for you with
some improved logging to help diagnose.
> Memtable heap size grows and many long GC pauses are triggered
> --------------------------------------------------------------
>
> Key: CASSANDRA-9681
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9681
> Project: Cassandra
> Issue Type: Bug
> Environment: C* 2.1.7, Debian Wheezy
> Reporter: mlowicki
> Assignee: Benedict
> Priority: Critical
> Fix For: 2.1.x
>
> Attachments: cassandra.yaml, db5.system.log, db5.system.log.1.zip,
> db5.system.log.2.zip, db5.system.log.3.zip, system.log.6.zip,
> system.log.7.zip, system.log.8.zip, system.log.9.zip
>
>
> C* 2.1.7 cluster is behaving really bad after 1-2 days.
> {{gauges.cassandra.jmx.org.apache.cassandra.metrics.ColumnFamily.AllMemtablesHeapSize.Value}}
> jumps to 7 GB
> (https://www.dropbox.com/s/vraggy292erkzd2/Screenshot%202015-06-29%2019.12.53.png?dl=0)
> on 3/6 nodes in each data center and then there are many long GC pauses.
> Cluster is using default heap size values ({{-Xms8192M -Xmx8192M -Xmn2048M}})
> Before C* 2.1.5 memtables heap size was basically constant ~500MB
> (https://www.dropbox.com/s/fjdywik5lojstvn/Screenshot%202015-06-29%2019.30.00.png?dl=0)
> After restarting all nodes is behaves stable for 1-2days. Today I've done
> that and long GC pauses are gone (~18:00
> https://www.dropbox.com/s/7vo3ynz505rsfq3/Screenshot%202015-06-29%2019.28.37.png?dl=0).
> The only pattern we've found so far is that long GC pauses are happening
> basically at the same time on all nodes in the same data center - even on the
> ones where memtables heap size is not growing.
> Cliffs on the graphs are nodes restarts.
> Used memory on boxes where {{AllMemtabelesHeapSize}} grows, stays at the same
> level -
> https://www.dropbox.com/s/tes9abykixs86rf/Screenshot%202015-06-29%2019.37.52.png?dl=0.
> Replication factor is set to 3.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)