[
https://issues.apache.org/jira/browse/CASSANDRA-9681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14608239#comment-14608239
]
Benedict commented on CASSANDRA-9681:
-------------------------------------
I suspect I have found the issue.
We can, under some circumstances, reduce the amount of memory used in a
memtable when we modify its contents (typically it only grows, or stays
stable). If this happens, we corrupt the bookkeeping, as a result of some
suboptimal choices in the API. The amount of corruption is likely to be very
small, however it accumulates over time.
I have a patch uploaded
[here|https://github.com/belliottsmith/cassandra/tree/9681], that I'm just
waiting on CI results to confirm doesn't break anything. This improves the API
to avoid this problem, fails earlier if the API is misused (although it should
now be robust to it), and logs more useful information for spotting this kind
of issue with greater ease. I will also follow up with specific regression
tests.
I'll let you know when the patch is ready to trial. If you could confirm it
fixes your issue, that would be greatly appreciated.
> Memtable heap size grows and many long GC pauses are triggered
> --------------------------------------------------------------
>
> Key: CASSANDRA-9681
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9681
> Project: Cassandra
> Issue Type: Bug
> Environment: C* 2.1.7, Debian Wheezy
> Reporter: mlowicki
> Assignee: Benedict
> Priority: Critical
> Fix For: 2.1.x
>
> Attachments: cassandra.yaml, db5.system.log, db5.system.log.1.zip,
> db5.system.log.2.zip, db5.system.log.3.zip, schema.cql, system.log.6.zip,
> system.log.7.zip, system.log.8.zip, system.log.9.zip
>
>
> C* 2.1.7 cluster is behaving really bad after 1-2 days.
> {{gauges.cassandra.jmx.org.apache.cassandra.metrics.ColumnFamily.AllMemtablesHeapSize.Value}}
> jumps to 7 GB
> (https://www.dropbox.com/s/vraggy292erkzd2/Screenshot%202015-06-29%2019.12.53.png?dl=0)
> on 3/6 nodes in each data center and then there are many long GC pauses.
> Cluster is using default heap size values ({{-Xms8192M -Xmx8192M -Xmn2048M}})
> Before C* 2.1.5 memtables heap size was basically constant ~500MB
> (https://www.dropbox.com/s/fjdywik5lojstvn/Screenshot%202015-06-29%2019.30.00.png?dl=0)
> After restarting all nodes is behaves stable for 1-2days. Today I've done
> that and long GC pauses are gone (~18:00
> https://www.dropbox.com/s/7vo3ynz505rsfq3/Screenshot%202015-06-29%2019.28.37.png?dl=0).
> The only pattern we've found so far is that long GC pauses are happening
> basically at the same time on all nodes in the same data center - even on the
> ones where memtables heap size is not growing.
> Cliffs on the graphs are nodes restarts.
> Used memory on boxes where {{AllMemtabelesHeapSize}} grows, stays at the same
> level -
> https://www.dropbox.com/s/tes9abykixs86rf/Screenshot%202015-06-29%2019.37.52.png?dl=0.
> Replication factor is set to 3.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)