[
https://issues.apache.org/jira/browse/CASSANDRA-9681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14606170#comment-14606170
]
Benedict commented on CASSANDRA-9681:
-------------------------------------
So, for posterity, I ran the following bash script for analysing the logs:
{code}
grep -E "Completed flushing|Enqueuing flush of ([^:]+): [0-9]+ \(([0-9]+)%\)"
system.log.2* | grep -v compactions_in_progress | sed -r "s@.* - (.*)@\1@" |
sed -r "s@Completed flushing .*-([^-]+)-ka-[0-9]+-Data.db.*@completed \1@" |
sed -r 's@Enqueuing flush of ([^ :]+): [0-9]+ \(([0-9]+)%.*@started \1 \2@' |
awk '{ if ($1 == "started") { total[$2] += $3; list[$2][end[$2]] = $3;
end[$2]++; } else { total[$2] -= list[$2][start[$2]]; delete
list[$2][start[$2]]; start[$2]++; } print("total:" total[$2] " " $0); }' | sort
| less
{code}
This indicates flushing is happening as expected, and staying well within the
bounds that are supposed to be enforced. These same numbers feed into those
that are reported via JMX. In fact, they should be strictly greater than that
returned by JMX, since JMX only reports the live memtables. So the numbers that
suggest you're exceeding your memtable space limits are hard to explain.
The heap dump will no doubt help a great deal.
> Memtable heap size grows and many long GC pauses are triggered
> --------------------------------------------------------------
>
> Key: CASSANDRA-9681
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9681
> Project: Cassandra
> Issue Type: Bug
> Environment: C* 2.1.7, Debian Wheezy
> Reporter: mlowicki
> Assignee: Benedict
> Priority: Critical
> Fix For: 2.1.x
>
> Attachments: cassandra.yaml, system.log.6.zip, system.log.7.zip,
> system.log.8.zip, system.log.9.zip
>
>
> C* 2.1.7 cluster is behaving really bad after 1-2 days.
> {{gauges.cassandra.jmx.org.apache.cassandra.metrics.ColumnFamily.AllMemtablesHeapSize.Value}}
> jumps to 7 GB
> (https://www.dropbox.com/s/vraggy292erkzd2/Screenshot%202015-06-29%2019.12.53.png?dl=0)
> on 3/6 nodes in each data center and then there are many long GC pauses.
> Cluster is using default heap size values ({{-Xms8192M -Xmx8192M -Xmn2048M}})
> Before C* 2.1.5 memtables heap size was basically constant ~500MB
> (https://www.dropbox.com/s/fjdywik5lojstvn/Screenshot%202015-06-29%2019.30.00.png?dl=0)
> After restarting all nodes is behaves stable for 1-2days. Today I've done
> that and long GC pauses are gone (~18:00
> https://www.dropbox.com/s/7vo3ynz505rsfq3/Screenshot%202015-06-29%2019.28.37.png?dl=0).
> The only pattern we've found so far is that long GC pauses are happening
> basically at the same time on all nodes in the same data center - even on the
> ones where memtables heap size is not growing.
> Cliffs on the graphs are nodes restarts.
> Used memory on boxes where {{AllMemtabelesHeapSize}} grows, stays at the same
> level -
> https://www.dropbox.com/s/tes9abykixs86rf/Screenshot%202015-06-29%2019.37.52.png?dl=0.
> Replication factor is set to 3.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)