Hi Graham, Unfortunately the problem is more difficult than you might think. References to the buffers can persist in-flight to clients long after the memtable is discarded, so you would be introducing a subtle corruption risk for data returned to clients. Unfortunately the current implementation in 2.1 won't solve this problem for on-heap buffers without introducing a performance penalty (by copying data from the buffers on read, as we currently do for off-heap data), so I don't expect this change will be introduced until zero-copy offheap memtables are introduced, which have been shelved for the moment.
On 15 Jun 2014 10:53, "graham sanderson" <gra...@vast.com> wrote: > Hi Benedict, > > So I had a look at the code, and as you say it looked pretty easy to > recycle on heap slabs… there is already RACE_ALLOCATED which keeps a > strongly referenced pool, however I was thinking in this case of just > WeakReferences. > > In terms of on heap slabs, it seemed to me that recycling the oldest slab > you have is probably the best heuristic, since it is less likely to be in > eden (of course re-using in eden is no worse than worst case today), > however since the problem tends to be promotion failure of slabs due to > fragmentation of old gen, recycling one that is already there is even > better - better still if it has been compacted somewhere pretty stable. I > think this heuristic would also work well for G1, though I believe the > recommendation is still not to use that with cassandra. > > So for implementation of that I was thinking of using a > ConcurrentSkipListMap, from a Long representing the allocation order of the > Region to a weak reference to the Region (just regular 1M sized ones)… > allocators can pull oldest and discard cleared references (might need a > scrubber if the map got too big and we were only checking the first entry). > Beyond that I don’t think there is any need for a configurable-lengthed > collection of strongly referenced reusable slabs. > > Question 1: > > This is easy enough to implement, and probably should just be turned on by > an orthogonal setting… I guess on heap slab is the current default, so this > feature will be useful > > Question 2: > > Something similar could be done for off heap slabs… this would seem more > like it would want a size limit on the number of re-usable slabs… strong > references with explicit clean() is probably better, than using > weak-references and letting PhantomReference cleaner on DirectByteBuffer do > the cleaning later. > > Let me know any thoughts and I’ll open an issue (probably 2 - one for on > heap one for off)… let me know whether you’d like me to assign the first to > you or me (I couldn’t work on it before next week) > > Thanks, > > Graham. > > On May 21, 2014, at 2:20 AM, Benedict Elliott Smith < > belliottsm...@datastax.com> wrote: > > > Graham, > > > > This is largely fixed in 2.1 with the introduction of partially off-heap > > memtables - the slabs reside off-heap, so do not cause any GC issues. > > > > As it happens the changes would also permit us to recycle on-heap slabs > > reasonable easily as well, so feel free to file a ticket for that, > although > > it won't be back ported to 2.0. > > > > > > On 21 May 2014 00:57, graham sanderson <gra...@vast.com> wrote: > > > >> So i’ve been tinkering a bit with CMS config because we are still seeing > >> fairly frequent full compacting GC due to framgentation/promotion > failure > >> > >> As mentioned below, we are usually too fragmented to promote new > in-flight > >> memtables. > >> > >> This is likely caused by sudden write spikes (which we do have), though > >> actually the problems don’t generally happen at that time of our largest > >> write spikes (though any write spikes likely cause spill of both new > >> memtables along with many other new objects of unknown size into the > >> tenured gen, so they cause fragmentation if not immediate GC issue). We > >> have lots of things going on in this multi-tenant cluster (GC pauses > are of > >> course extra bad, since they cause spike in hinted-handoff on other > nodes > >> which were already busy etc…) > >> > >> Anyway, considering possibilities: > >> > >> 0) Try and make our application behavior more steady state - this is > >> probably possible, but there are lots of other things (e.g. compaction, > >> opscenter, repair etc.) which are both tunable and generally > throttle-able > >> to think about too. > >> 1) Play with tweaking PLAB configs to see if we can ease fragmentation > >> (I’d be curious what the “crud” is in particular that is getting > spilled - > >> presumably it is larger objects since it affects the binary tree of > large > >> objects) > >> 2) Given the above, if we can guarantee even > 24 hours without full > GC, I > >> don’t think we’d mind running a regular rolling re-start on the servers > >> during off hours (note usually the GCs don’t have a visible impact, but > >> when they hit multiple machines at once they can) > >> 3) Zing is seriously an option, if it would save us large amounts of > >> tuning, and constant worry about the “next” thing tweaking the > allocation > >> patterns - does anyone have any experience with Zing & Cassandra > >> 4) Given that we expect periodic bursts of writes, > >> memtable_total_space_in_mb is bounded, we are not actually short of > memory > >> (it just gets fragmented), I’m wondering if anyone has played with > pinning > >> (up to or initially?) that many 1MB chunks of memory via SlabAllocator > and > >> re-using… It will get promoted once, and then these 1M chunks won’t be > part > >> of the subsequent promotion hassle… it will probably also allow more > crud > >> to die in eden under write load since we aren’t allocating these large > >> chunks in eden at the same time. Anyway, I had a little look at the > code, > >> and the life cycles of memtables is not trivial, but was considering > >> attempting a patch to play with… anyone have any thoughts? > >> > >> Basically in summary, the Slab allocator helps by allocating and freeing > >> lots of objects at the same time, however any time slabs are allocated > >> under load, we end up promoting them with whatever other live stuff in > eden > >> is still there. If we only do this once and reuse the slabs, we are > likely > >> to minimize our promotion problem later (at least for these large > objects) > >> > >> On May 16, 2014, at 9:37 PM, graham sanderson <gra...@vast.com> wrote: > >> > >>> Excellent - thank you… > >>> > >>> On May 16, 2014, at 7:08 AM, Samuel CARRIERE < > samuel.carri...@urssaf.fr> > >> wrote: > >>> > >>>> Hi, > >>>> This is arena allocation of memtables. See here for more infos : > >>>> > http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-performance > >>>> > >>>> > >>>> > >>>> > >>>> De : graham sanderson <gra...@vast.com> > >>>> A : email@example.com, > >>>> Date : 16/05/2014 14:03 > >>>> Objet : Things that are about 1M big > >>>> > >>>> > >>>> > >>>> So just throwing this out there for those for whom this might ring a > >> bell. > >>>> > >>>> I?m debugging some CMS memory fragmentation issues on 2.0.5 - and > >>>> interestingly enough most of the objects giving us promotion failures > >> are > >>>> of size 131074 (dwords) - GC logging obviously doesn?t say what those > >> are, > >>>> but I?d wager money they are either 1M big byte arrays, or less likely > >>>> 256k entry object arrays backing large maps > >>>> > >>>> So not strictly critical to solving my problem, but I was wondering if > >>>> anyone can think of any heap allocated C* objects which are (with no > >>>> significant changes to standard cassandra config) allocated in 1M > >> chunks. > >>>> (It would save me scouring the code, or a 9 gig heap dump if I need to > >>>> figure it out!) > >>>> > >>>> Thanks, > >>>> > >>>> Graham > >>> > >> > >> > >