[jira] [Resolved] (CASSANDRA-6405) When making heavy use of counters, neighbor nodes occasionally enter spiral of constant memory consumpion

Aleksey Yeschenko (JIRA) Thu, 10 Apr 2014 11:19:05 -0700

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Aleksey Yeschenko resolved CASSANDRA-6405.
------------------------------------------

       Resolution: Fixed
    Fix Version/s: 2.1 beta2
    Reproduced In: 1.2.11, 1.1.7, 1.0.12  (was: 1.0.12, 1.1.7, 1.2.11)

CASSANDRA-6506 has been delayed until 3.0, but this issues is now actually 
resolved in 2.1 by the combination of new memtable code and various counters++ 
commits (including, but not limited to, part of CASSANDRA-6506 and 
CASSANDRA-6953).

> When making heavy use of counters, neighbor nodes occasionally enter spiral 
> of constant memory consumpion
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-6405
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6405
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: RF of 3, 15 nodes.
> Sun Java 7 (also occurred in OpenJDK 6, and Sun Java 6).
> Xmx of 8G.
> No row cache.
>            Reporter: Jason Harvey
>             Fix For: 2.1 beta2
>
>         Attachments: threaddump.txt
>
>
> We're randomly running into an interesting issue on our ring. When making use 
> of counters, we'll occasionally have 3 nodes (always neighbors) suddenly 
> start immediately filling up memory, CMSing, fill up again, repeat. This 
> pattern goes on for 5-20 minutes. Nearly all requests to the nodes time out 
> during this period. Restarting one, two, or all three of the nodes does not 
> resolve the spiral; after a restart the three nodes immediately start hogging 
> up memory again and CMSing constantly.
> When the issue resolves itself, all 3 nodes immediately get better. Sometimes 
> it reoccurs in bursts, where it will be trashed for 20 minutes, fine for 5, 
> trashed for 20, and repeat that cycle a few times.
> There are no unusual logs provided by cassandra during this period of time, 
> other than recording of the constant dropped read requests and the constant 
> CMS runs. I have analyzed the log files prior to multiple distinct instances 
> of this issue and have found no preceding events which are associated with 
> this issue.
> I have verified that our apps are not performing any unusual number or type 
> of requests during this time.
> This behaviour occurred on 1.0.12, 1.1.7, and now on 1.2.11.
> The way I've narrowed this down to counters is a bit naive. It started 
> happening when we started making use of counter columns, went away after we 
> rolled back use of counter columns. I've repeated this attempted rollout on 
> each version now, and it consistently rears its head every time. I should 
> note this incident does _seem_ to happen more rarely on 1.2.11 compared to 
> the previous versions.
> This incident has been consistent across multiple different types of 
> hardware, as well as major kernel version changes (2.6 all the way to 3.2). 
> The OS is operating normally during the event.
> I managed to get an hprof dump when the issue was happening in the wild. 
> Something notable in the class instance counts as reported by jhat. Here are 
> the top 5 counts for this one node:
> {code}
> 5967846 instances of class org.apache.cassandra.db.CounterColumn 
> 1247525 instances of class 
> com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$WeightedValue 
> 1247310 instances of class org.apache.cassandra.cache.KeyCacheKey 
> 1246648 instances of class 
> com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$Node 
> 1237526 instances of class org.apache.cassandra.db.RowIndexEntry 
> {code}
> Is it normal or expected for CounterColumn to have that number of instances?
> The data model for how we use counters is as follows: between 50-20000 
> counter columns per key. We currently have around 3 million keys total, but 
> this issue also replicated when we only had a few thousand keys total. 
> Average column count is around 1k, and 90th is 18k. New columns are added 
> regularly, and columns are incremented regularly. No column or key deletions 
> occur. We probably have 1-5k "hot" keys at any given time, spread across the 
> entire ring. R:W ratio is typically around 50:1. This is the only CF we're 
> using counters on, at this time. CF details are as follows:
> {code}
>     ColumnFamily: CommentTree
>       Key Validation Class: org.apache.cassandra.db.marshal.AsciiType
>       Default column value validator: 
> org.apache.cassandra.db.marshal.CounterColumnType
>       Cells sorted by: 
> org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.LongType)
>       GC grace seconds: 864000
>       Compaction min/max thresholds: 4/32
>       Read repair chance: 0.01
>       DC Local Read repair chance: 0.0
>       Populate IO Cache on flush: false
>       Replicate on write: true
>       Caching: KEYS_ONLY
>       Bloom Filter FP chance: default
>       Built indexes: []
>       Compaction Strategy: 
> org.apache.cassandra.db.compaction.LeveledCompactionStrategy
>       Compaction Strategy Options:
>         sstable_size_in_mb: 160
>                 Column Family: CommentTree
>                 SSTable count: 30
>                 SSTables in each level: [1, 10, 19, 0, 0, 0, 0, 0, 0]
>                 Space used (live): 4656930594
>                 Space used (total): 4677221791
>                 SSTable Compression Ratio: 0.0
>                 Number of Keys (estimate): 679680
>                 Memtable Columns Count: 8289
>                 Memtable Data Size: 2639908
>                 Memtable Switch Count: 5769
>                 Read Count: 185479324
>                 Read Latency: 1.786 ms.
>                 Write Count: 5377562
>                 Write Latency: 0.026 ms.
>                 Pending Tasks: 0
>                 Bloom Filter False Positives: 2914204
>                 Bloom Filter False Ratio: 0.56403
>                 Bloom Filter Space Used: 523952
>                 Compacted row minimum size: 30
>                 Compacted row maximum size: 4866323
>                 Compacted row mean size: 7742
>                 Average live cells per slice (last five minutes): 39.0
>                 Average tombstones per slice (last five minutes): 0.0
> {code}
> Please let me know if I can provide any further information. I can provide 
> the hprof if desired, however it is 3GB so I'll need to provide it outside of 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (CASSANDRA-6405) When making heavy use of counters, neighbor nodes occasionally enter spiral of constant memory consumpion

Reply via email to