[
https://issues.apache.org/jira/browse/CASSANDRA-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aleksey Yeschenko resolved CASSANDRA-6405.
------------------------------------------
Resolution: Fixed
Fix Version/s: 2.1 beta2
Reproduced In: 1.2.11, 1.1.7, 1.0.12 (was: 1.0.12, 1.1.7, 1.2.11)
CASSANDRA-6506 has been delayed until 3.0, but this issues is now actually
resolved in 2.1 by the combination of new memtable code and various counters++
commits (including, but not limited to, part of CASSANDRA-6506 and
CASSANDRA-6953).
> When making heavy use of counters, neighbor nodes occasionally enter spiral
> of constant memory consumpion
> ---------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-6405
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6405
> Project: Cassandra
> Issue Type: Bug
> Environment: RF of 3, 15 nodes.
> Sun Java 7 (also occurred in OpenJDK 6, and Sun Java 6).
> Xmx of 8G.
> No row cache.
> Reporter: Jason Harvey
> Fix For: 2.1 beta2
>
> Attachments: threaddump.txt
>
>
> We're randomly running into an interesting issue on our ring. When making use
> of counters, we'll occasionally have 3 nodes (always neighbors) suddenly
> start immediately filling up memory, CMSing, fill up again, repeat. This
> pattern goes on for 5-20 minutes. Nearly all requests to the nodes time out
> during this period. Restarting one, two, or all three of the nodes does not
> resolve the spiral; after a restart the three nodes immediately start hogging
> up memory again and CMSing constantly.
> When the issue resolves itself, all 3 nodes immediately get better. Sometimes
> it reoccurs in bursts, where it will be trashed for 20 minutes, fine for 5,
> trashed for 20, and repeat that cycle a few times.
> There are no unusual logs provided by cassandra during this period of time,
> other than recording of the constant dropped read requests and the constant
> CMS runs. I have analyzed the log files prior to multiple distinct instances
> of this issue and have found no preceding events which are associated with
> this issue.
> I have verified that our apps are not performing any unusual number or type
> of requests during this time.
> This behaviour occurred on 1.0.12, 1.1.7, and now on 1.2.11.
> The way I've narrowed this down to counters is a bit naive. It started
> happening when we started making use of counter columns, went away after we
> rolled back use of counter columns. I've repeated this attempted rollout on
> each version now, and it consistently rears its head every time. I should
> note this incident does _seem_ to happen more rarely on 1.2.11 compared to
> the previous versions.
> This incident has been consistent across multiple different types of
> hardware, as well as major kernel version changes (2.6 all the way to 3.2).
> The OS is operating normally during the event.
> I managed to get an hprof dump when the issue was happening in the wild.
> Something notable in the class instance counts as reported by jhat. Here are
> the top 5 counts for this one node:
> {code}
> 5967846 instances of class org.apache.cassandra.db.CounterColumn
> 1247525 instances of class
> com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$WeightedValue
> 1247310 instances of class org.apache.cassandra.cache.KeyCacheKey
> 1246648 instances of class
> com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$Node
> 1237526 instances of class org.apache.cassandra.db.RowIndexEntry
> {code}
> Is it normal or expected for CounterColumn to have that number of instances?
> The data model for how we use counters is as follows: between 50-20000
> counter columns per key. We currently have around 3 million keys total, but
> this issue also replicated when we only had a few thousand keys total.
> Average column count is around 1k, and 90th is 18k. New columns are added
> regularly, and columns are incremented regularly. No column or key deletions
> occur. We probably have 1-5k "hot" keys at any given time, spread across the
> entire ring. R:W ratio is typically around 50:1. This is the only CF we're
> using counters on, at this time. CF details are as follows:
> {code}
> ColumnFamily: CommentTree
> Key Validation Class: org.apache.cassandra.db.marshal.AsciiType
> Default column value validator:
> org.apache.cassandra.db.marshal.CounterColumnType
> Cells sorted by:
> org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.LongType)
> GC grace seconds: 864000
> Compaction min/max thresholds: 4/32
> Read repair chance: 0.01
> DC Local Read repair chance: 0.0
> Populate IO Cache on flush: false
> Replicate on write: true
> Caching: KEYS_ONLY
> Bloom Filter FP chance: default
> Built indexes: []
> Compaction Strategy:
> org.apache.cassandra.db.compaction.LeveledCompactionStrategy
> Compaction Strategy Options:
> sstable_size_in_mb: 160
> Column Family: CommentTree
> SSTable count: 30
> SSTables in each level: [1, 10, 19, 0, 0, 0, 0, 0, 0]
> Space used (live): 4656930594
> Space used (total): 4677221791
> SSTable Compression Ratio: 0.0
> Number of Keys (estimate): 679680
> Memtable Columns Count: 8289
> Memtable Data Size: 2639908
> Memtable Switch Count: 5769
> Read Count: 185479324
> Read Latency: 1.786 ms.
> Write Count: 5377562
> Write Latency: 0.026 ms.
> Pending Tasks: 0
> Bloom Filter False Positives: 2914204
> Bloom Filter False Ratio: 0.56403
> Bloom Filter Space Used: 523952
> Compacted row minimum size: 30
> Compacted row maximum size: 4866323
> Compacted row mean size: 7742
> Average live cells per slice (last five minutes): 39.0
> Average tombstones per slice (last five minutes): 0.0
> {code}
> Please let me know if I can provide any further information. I can provide
> the hprof if desired, however it is 3GB so I'll need to provide it outside of
> JIRA.
--
This message was sent by Atlassian JIRA
(v6.2#6252)