[ 
https://issues.apache.org/jira/browse/CASSANDRA-9681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14607982#comment-14607982
 ] 

Benedict commented on CASSANDRA-9681:
-------------------------------------

Those logs are much more helpful. It suggests the heap dump will not be as 
useful as hoped - if it's nearly ready, please do upload it, but if it's 
degrading your cluster or causing you problems please feel free not to. Thread 
dumps spread over 10 minutely intervals might be helpful, however.

{code}
system.log.2015-06-29_1419:INFO  [SlabPoolCleaner] 2015-06-29 08:47:56,322 
ColumnFamilyStore.java:906 - Enqueuing flush of entity: 2469606408 (118%) 
on-heap, 0 (0%) off-heap
system.log.2015-06-29_1419:INFO  [SlabPoolCleaner] 2015-06-29 08:54:09,873 
ColumnFamilyStore.java:906 - Enqueuing flush of entity_by_id: 2633134676 (126%) 
on-heap, 0 (0%) off-heap
system.log.2015-06-29_1419:INFO  [SlabPoolCleaner] 2015-06-29 09:00:21,353 
ColumnFamilyStore.java:906 - Enqueuing flush of entity: 4191587102 (200%) 
on-heap, 0 (0%) off-heap
system.log.2015-06-29_1419:INFO  [SlabPoolCleaner] 2015-06-29 09:10:39,736 
ColumnFamilyStore.java:906 - Enqueuing flush of entity: 3455005866 (165%) 
on-heap, 0 (0%) off-heap
system.log.2015-06-29_1419:INFO  [SlabPoolCleaner] 2015-06-29 09:19:18,950 
ColumnFamilyStore.java:906 - Enqueuing flush of entity: 2879879873 (138%) 
on-heap, 0 (0%) off-heap
system.log.2015-06-29_1419:INFO  [SlabPoolCleaner] 2015-06-29 09:26:28,874 
ColumnFamilyStore.java:906 - Enqueuing flush of entity: 2400355359 (115%) 
on-heap, 0 (0%) off-heap
system.log.2015-06-29_1419:INFO  [SlabPoolCleaner] 2015-06-29 09:32:16,612 
ColumnFamilyStore.java:906 - Enqueuing flush of entity_by_id: 2630156881 (126%) 
on-heap, 0 (0%) off-heap
system.log.2015-06-29_1419:INFO  [SlabPoolCleaner] 2015-06-29 09:38:35,445 
ColumnFamilyStore.java:906 - Enqueuing flush of entity: 4232454175 (202%) 
on-heap, 0 (0%) off-heap
system.log.2015-06-29_1419:INFO  [SlabPoolCleaner] 2015-06-29 09:47:56,679 
ColumnFamilyStore.java:906 - Enqueuing flush of entity: 3475375055 (166%) 
on-heap, 0 (0%) off-heap
system.log.2015-06-29_1419:INFO  [SlabPoolCleaner] 2015-06-29 09:56:44,672 
ColumnFamilyStore.java:906 - Enqueuing flush of entity: 3046517722 (146%) 
on-heap, 0 (0%) off-heap
system.log.2015-06-29_1419:INFO  [SlabPoolCleaner] 2015-06-29 10:04:09,950 
ColumnFamilyStore.java:906 - Enqueuing flush of entity: 2522479792 (121%) 
on-heap, 0 (0%) off-heap
{code}

So, we can definitely see that things are messed up, and that Cassandra is 
aware of it. The real question is why it is permitting this to occur, and more 
importantly why it has changed. There is only one reason we ostensibly permit 
overshoot of the heap utilisation of memtables, and that is to permit writes 
that were already in progress when we decided to flush to complete. This 
doesn't seem to adequately explain the massive overshoot, and the behaviour 
here hasn't changed between these versions. 

Do you by any chance use secondary indexes? These are a potential (but 
unlikely) source of large writes that would have permission to exceed the heap 
space utilisation, if they were very stale.

Could you upload your schema, and let us know what compaction strategy you're 
using?

I think it will likely be necessary to release a patched version for you with 
some improved logging to help diagnose.


> Memtable heap size grows and many long GC pauses are triggered
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-9681
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9681
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: C* 2.1.7, Debian Wheezy
>            Reporter: mlowicki
>            Assignee: Benedict
>            Priority: Critical
>             Fix For: 2.1.x
>
>         Attachments: cassandra.yaml, db5.system.log, db5.system.log.1.zip, 
> db5.system.log.2.zip, db5.system.log.3.zip, system.log.6.zip, 
> system.log.7.zip, system.log.8.zip, system.log.9.zip
>
>
> C* 2.1.7 cluster is behaving really bad after 1-2 days. 
> {{gauges.cassandra.jmx.org.apache.cassandra.metrics.ColumnFamily.AllMemtablesHeapSize.Value}}
>  jumps to 7 GB 
> (https://www.dropbox.com/s/vraggy292erkzd2/Screenshot%202015-06-29%2019.12.53.png?dl=0)
>  on 3/6 nodes in each data center and then there are many long GC pauses. 
> Cluster is using default heap size values ({{-Xms8192M -Xmx8192M -Xmn2048M}})
> Before C* 2.1.5 memtables heap size was basically constant ~500MB 
> (https://www.dropbox.com/s/fjdywik5lojstvn/Screenshot%202015-06-29%2019.30.00.png?dl=0)
> After restarting all nodes is behaves stable for 1-2days. Today I've done 
> that and long GC pauses are gone (~18:00 
> https://www.dropbox.com/s/7vo3ynz505rsfq3/Screenshot%202015-06-29%2019.28.37.png?dl=0).
>  The only pattern we've found so far is that long GC  pauses are happening 
> basically at the same time on all nodes in the same data center - even on the 
> ones where memtables heap size is not growing.
> Cliffs on the graphs are nodes restarts.
> Used memory on boxes where {{AllMemtabelesHeapSize}} grows, stays at the same 
> level - 
> https://www.dropbox.com/s/tes9abykixs86rf/Screenshot%202015-06-29%2019.37.52.png?dl=0.
> Replication factor is set to 3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to