Hi Graham,

Unfortunately the problem is more difficult than you might think.
References to the buffers can persist in-flight to clients long after the
memtable is discarded, so you would be introducing a subtle corruption risk
for data returned to clients. Unfortunately the current implementation in
2.1 won't solve this problem for on-heap buffers without introducing a
performance penalty (by copying data from the buffers on read, as we
currently do for off-heap data), so I don't expect this change will be
introduced until zero-copy offheap memtables are introduced, which have
been shelved for the moment.


On 15 Jun 2014 10:53, "graham sanderson" <gra...@vast.com> wrote:

> Hi Benedict,
>
> So I had a look at the code, and as you say it looked pretty easy to
> recycle on heap slabs… there is already RACE_ALLOCATED which keeps a
> strongly referenced pool, however I was thinking in this case of just
> WeakReferences.
>
> In terms of on heap slabs, it seemed to me that recycling the oldest slab
> you have is probably the best heuristic, since it is less likely to be in
> eden (of course re-using in eden is no worse than worst case today),
> however since the problem tends to be promotion failure of slabs due to
> fragmentation of old gen, recycling one that is already there is even
> better - better still if it has been compacted somewhere pretty stable. I
> think this heuristic would also work well for G1, though I believe the
> recommendation is still not to use that with cassandra.
>
> So for implementation of that I was thinking of using a
> ConcurrentSkipListMap, from a Long representing the allocation order of the
> Region to a weak reference to the Region (just regular 1M sized ones)…
> allocators can pull oldest and discard cleared references (might need a
> scrubber if the map got too big and we were only checking the first entry).
> Beyond that I don’t think there is any need for a configurable-lengthed
> collection of strongly referenced reusable slabs.
>
> Question 1:
>
> This is easy enough to implement, and probably should just be turned on by
> an orthogonal setting… I guess on heap slab is the current default, so this
> feature will be useful
>
> Question 2:
>
> Something similar could be done for off heap slabs… this would seem more
> like it would want a size limit on the number of re-usable slabs… strong
> references with explicit clean() is probably better, than using
> weak-references and letting PhantomReference cleaner on DirectByteBuffer do
> the cleaning later.
>
> Let me know any thoughts and I’ll open an issue (probably 2 - one for on
> heap one for off)… let me know whether you’d like me to assign the first to
> you or me (I couldn’t work on it before next week)
>
> Thanks,
>
> Graham.
>
> On May 21, 2014, at 2:20 AM, Benedict Elliott Smith <
> belliottsm...@datastax.com> wrote:
>
> > Graham,
> >
> > This is largely fixed in 2.1 with the introduction of partially off-heap
> > memtables - the slabs reside off-heap, so do not cause any GC issues.
> >
> > As it happens the changes would also permit us to recycle on-heap slabs
> > reasonable easily as well, so feel free to file a ticket for that,
> although
> > it won't be back ported to 2.0.
> >
> >
> > On 21 May 2014 00:57, graham sanderson <gra...@vast.com> wrote:
> >
> >> So i’ve been tinkering a bit with CMS config because we are still seeing
> >> fairly frequent full compacting GC due to framgentation/promotion
> failure
> >>
> >> As mentioned below, we are usually too fragmented to promote new
> in-flight
> >> memtables.
> >>
> >> This is likely caused by sudden write spikes (which we do have), though
> >> actually the problems don’t generally happen at that time of our largest
> >> write spikes (though any write spikes likely cause spill of both new
> >> memtables along with many other new objects of unknown size into the
> >> tenured gen, so they cause fragmentation if not immediate GC issue). We
> >> have lots of things going on in this multi-tenant cluster (GC pauses
> are of
> >> course extra bad, since they cause spike in hinted-handoff on other
> nodes
> >> which were already busy etc…)
> >>
> >> Anyway, considering possibilities:
> >>
> >> 0) Try and make our application behavior more steady state - this is
> >> probably possible, but there are lots of other things (e.g. compaction,
> >> opscenter, repair etc.) which are both tunable and generally
> throttle-able
> >> to think about too.
> >> 1) Play with tweaking PLAB configs to see if we can ease fragmentation
> >> (I’d be curious what the “crud” is in particular that is getting
> spilled -
> >> presumably it is larger objects since it affects the binary tree of
> large
> >> objects)
> >> 2) Given the above, if we can guarantee even > 24 hours without full
> GC, I
> >> don’t think we’d mind running a regular rolling re-start on the servers
> >> during off hours (note usually the GCs don’t have a visible impact, but
> >> when they hit multiple machines at once they can)
> >> 3) Zing is seriously an option, if it would save us large amounts of
> >> tuning, and constant worry about the “next” thing tweaking the
> allocation
> >> patterns - does anyone have any experience with Zing & Cassandra
> >> 4) Given that we expect periodic bursts of writes,
> >> memtable_total_space_in_mb is bounded, we are not actually short of
> memory
> >> (it just gets fragmented), I’m wondering if anyone has played with
> pinning
> >> (up to or initially?) that many 1MB chunks of memory via SlabAllocator
> and
> >> re-using… It will get promoted once, and then these 1M chunks won’t be
> part
> >> of the subsequent promotion hassle… it will probably also allow more
> crud
> >> to die in eden under write load since we aren’t allocating these large
> >> chunks in eden at the same time. Anyway, I had a little look at the
> code,
> >> and the life cycles of memtables is not trivial, but was considering
> >> attempting a patch to play with… anyone have any thoughts?
> >>
> >> Basically in summary, the Slab allocator helps by allocating and freeing
> >> lots of objects at the same time, however any time slabs are allocated
> >> under load, we end up promoting them with whatever other live stuff in
> eden
> >> is still there. If we only do this once and reuse the slabs, we are
> likely
> >> to minimize our promotion problem later (at least for these large
> objects)
> >>
> >> On May 16, 2014, at 9:37 PM, graham sanderson <gra...@vast.com> wrote:
> >>
> >>> Excellent - thank you…
> >>>
> >>> On May 16, 2014, at 7:08 AM, Samuel CARRIERE <
> samuel.carri...@urssaf.fr>
> >> wrote:
> >>>
> >>>> Hi,
> >>>> This is arena allocation of memtables. See here for more infos :
> >>>>
> http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-performance
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> De :    graham sanderson <gra...@vast.com>
> >>>> A :     dev@cassandra.apache.org,
> >>>> Date :  16/05/2014 14:03
> >>>> Objet : Things that are about 1M big
> >>>>
> >>>>
> >>>>
> >>>> So just throwing this out there for those for whom this might ring a
> >> bell.
> >>>>
> >>>> I?m debugging some CMS memory fragmentation issues on 2.0.5 - and
> >>>> interestingly enough most of the objects giving us promotion failures
> >> are
> >>>> of size 131074 (dwords) - GC logging obviously doesn?t say what those
> >> are,
> >>>> but I?d wager money they are either 1M big byte arrays, or less likely
> >>>> 256k entry object arrays backing large maps
> >>>>
> >>>> So not strictly critical to solving my problem, but I was wondering if
> >>>> anyone can think of any heap allocated C* objects which are (with no
> >>>> significant changes to standard cassandra config) allocated in 1M
> >> chunks.
> >>>> (It would save me scouring the code, or a 9 gig heap dump if I need to
> >>>> figure it out!)
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Graham
> >>>
> >>
> >>
>
>

Reply via email to