[ 
https://issues.apache.org/jira/browse/CASSANDRA-9681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14608239#comment-14608239
 ] 

Benedict commented on CASSANDRA-9681:
-------------------------------------

I suspect I have found the issue.

We can, under some circumstances, reduce the amount of memory used in a 
memtable when we modify its contents (typically it only grows, or stays 
stable). If this happens, we corrupt the bookkeeping, as a result of some 
suboptimal choices in the API. The amount of corruption is likely to be very 
small, however it accumulates over time.

I have a patch uploaded 
[here|https://github.com/belliottsmith/cassandra/tree/9681], that I'm just 
waiting on CI results to confirm doesn't break anything. This improves the API 
to avoid this problem, fails earlier if the API is misused (although it should 
now be robust to it), and logs more useful information for spotting this kind 
of issue with greater ease. I will also follow up with specific regression 
tests.

I'll let you know when the patch is ready to trial. If you could confirm it 
fixes your issue, that would be greatly appreciated.

> Memtable heap size grows and many long GC pauses are triggered
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-9681
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9681
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: C* 2.1.7, Debian Wheezy
>            Reporter: mlowicki
>            Assignee: Benedict
>            Priority: Critical
>             Fix For: 2.1.x
>
>         Attachments: cassandra.yaml, db5.system.log, db5.system.log.1.zip, 
> db5.system.log.2.zip, db5.system.log.3.zip, schema.cql, system.log.6.zip, 
> system.log.7.zip, system.log.8.zip, system.log.9.zip
>
>
> C* 2.1.7 cluster is behaving really bad after 1-2 days. 
> {{gauges.cassandra.jmx.org.apache.cassandra.metrics.ColumnFamily.AllMemtablesHeapSize.Value}}
>  jumps to 7 GB 
> (https://www.dropbox.com/s/vraggy292erkzd2/Screenshot%202015-06-29%2019.12.53.png?dl=0)
>  on 3/6 nodes in each data center and then there are many long GC pauses. 
> Cluster is using default heap size values ({{-Xms8192M -Xmx8192M -Xmn2048M}})
> Before C* 2.1.5 memtables heap size was basically constant ~500MB 
> (https://www.dropbox.com/s/fjdywik5lojstvn/Screenshot%202015-06-29%2019.30.00.png?dl=0)
> After restarting all nodes is behaves stable for 1-2days. Today I've done 
> that and long GC pauses are gone (~18:00 
> https://www.dropbox.com/s/7vo3ynz505rsfq3/Screenshot%202015-06-29%2019.28.37.png?dl=0).
>  The only pattern we've found so far is that long GC  pauses are happening 
> basically at the same time on all nodes in the same data center - even on the 
> ones where memtables heap size is not growing.
> Cliffs on the graphs are nodes restarts.
> Used memory on boxes where {{AllMemtabelesHeapSize}} grows, stays at the same 
> level - 
> https://www.dropbox.com/s/tes9abykixs86rf/Screenshot%202015-06-29%2019.37.52.png?dl=0.
> Replication factor is set to 3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to