[ 
https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13916720#comment-13916720
 ] 

Benedict commented on CASSANDRA-6689:
-------------------------------------

[~xedin] the latest latest is offheap2c, which I have just uploaded, and is for 
ticket 6694, building on this ticket. I think it's the better one to review, 
personally, but if you want to start with this ticket first offheap1b is the 
latest, yes.

To possibly aid the reviewers to understand the decisions taken here, I will 
outline briefly the main areas of code that have been changed, and why the 
approach was selected. I will leave details about _how_ it works to be 
addressed by the documentation in the codebase, and won't cover if or why we 
want off-heap memtables, only the reasons for implementing them in this way.

First, though, I'd like to potentially put some words in Marcus' mouth, and 
suggest that when he says it is "complicated/involved" he's mostly talking 
about the complexity in reasoning about thread ordering/behaviour, as we do not 
use locks anywhere, and we endeavour to communicate between threads as little 
as possible, and we use the JMM to its fullest (and in this ticket only, abuse 
ByteBuffer badly, and depend sometimes on x86 ordering guarantees rather than 
the JMM). As such it is definitely a complex patch, but actually almost all of 
the complexity is in OffHeapCleaner (NativeCleaner in 6694) and 
OffHeapAllocator (NativeAllocator), with little extra snippets in the 
concurrency utilities, in Referrer/s, and also in the work introduced in 
CASSANDRA-5549 which is probably new to everyone reviewing. This is not 
complexity in the normal sense of code sprawl (although there are lots of 
places touched by this, mostly these are simple modifications), but in that it 
requires sitting and thinking hard about the model and the object lifecycles as 
documented. This is why I have spent a great deal of time specifying the 
assumptions and safety concerns at any of the danger points. This kind of 
complexity is difficult to justify in a simple paragraph or three, and either 
stands or falls by the code itself and its design decisions; I don't think I 
can summarise it here any better than the code comments, and I have no 
alternative implementation to compare and contrast with.

In this patch (ignoring CASSANDRA-6694 for now) there are three main areas of 
changes:
1) New concurrency utilities
2) OffHeapAllocator et al
3) RefAction changes across the codebase

As to 1, these simply help to make the implementation easier and safer. NBQ 
adds a number of useful behaviours, such as CAS-like modification, safe 
iterator removal, multiple views on the same queue. These facilities improve 
clarity and obviousness in a number of places. WaitQueue and OpOrder are also 
improved from those introduced in CASSANDRA-5549. OpOrder prevents garbage 
accumulation from old Group objects being kept floating around in Referrer 
instances (and makes it faster); WaitQueue reduces the cost of wake-ups by 
eliminating a race through use of NBQ. All of the concurrency utilities could 
be split off into a separate patch if we wanted.

As to 2 and 3, as a quick starting point which I will refer back to, I will 
paste a quick bit about the memory lifecycle from the code:

{code}
 * 1) Initially memory is managed like malloc/free: it is considered referenced 
at least until the allocation is
 *    matched by a corresponding free()
 * 2) The memory is then protected by the associated OpOrder(s); an allocation 
is only available for collection
 *    once all operations started prior to the free() have completed. These are 
effectively short lived transactions that
 *    should always complete in a timely fashion (never block or wait 
indefinitely)
 * 3) During this time any operation protected by the read/write OpOrder may 
optionally 'ref' {@link Referrer} an object,
 *    or objects, that each reference (in the normal java sense) some 
allocations; once the read operation finishes these
 *    will have been registered with one or more GC roots {@link Referrers} - 
one per participating allocator group (CFS).
 * 4) When a GC (or allocator discard) occurs, any extant {@link Referrer} that 
were created during an operation that 
 *    began prior to the collect phase are walked, and any regions that are 
reachable from any of these 'refs' are
 *    switched to a refcount phase. When each ref completes it decrements the 
count of any such regions it reached.
 *    Once this count hits 0, the region is finally eligible for reuse.
{code}

Now, as Jonathan mentioned, in CASSANDRA-5549, we introduced the OpOrder 
concurrency primitive. This provides a mechanism for cleaning up on-heap 
memtables because we can guarantee that no new writes will touch them. They can 
also guarantee no new reads will touch them. The problem is that a read's 
lifetime lasts beyond touching the memtable, so as soon as we start managing 
the lifecycle of the memory, we have to somehow track the lifetime of any 
references that escape during a read.

Since these references will be preventing progress/tying up resources, we need 
to make sure they're handled safely, so we pass a Referrer (or another 
RefAction) into the read methods, which is used to perform any bookkeeping 
necessary to ensure this safety. We pass it in from the originator/caller so we 
can protect it with a try/finally block as far as possible, although there's a 
period for most when they live on a queue only (either waiting to be sent by 
MessageService or Netty), and only close up the resources when the message has 
been serialized (NB: One thing to explore is actually whether Netty can fail 
internally at any time without yielding an error to us on one of the paths we 
expect, as this could potentially lead to resource leaks).

Now, to get as far as here I don't think we really have any alternatives 
available to us. We have to track some state until we are done with the data. 
However at steps 3 and 4 we do have the potential for a different decision: the 
referrer currently tracks the specific objects returned by the read; we could 
instead track the OpOrder.Group(s) we read the message in. The added complexity 
here, however, is not very large (400LOC max), and the benefits to the current 
approach are pretty tremendous: we don't snarl up the whole system because of 
one slow client (or, say, missing a leak in Netty). Anything we might do to 
detect/mitigate the risk would almost certainly be as or more complex, and 
would leave me uneasy about its safety. If it went wrong it could spread to the 
whole cluster, as any one dying node coould quickly snarl up another, who is 
waiting for that node to consume a message they have queued for it, which 
snarls up another... and before you know it everyone's having a bad time.

Finally, as an icing on the cake we come to GC. We have to do all of these 
above steps anyway for memtable discarding, since we have to reclaim the memory 
regions, so we need to know when that's safe to happen. So we piggyback off of 
this, and we free() records in memtables whenever they are overwritten, so that 
whenever we detect the need to flush we first trigger a global memtable GC, 
which will try to reclaim any regions that have free space in them and 
consolidate them into new compacted regions. If we bring ourselves under the 
threshold for flushing, we don't flush. This addition allows us to have our 
cake and eat it when dealing with overwrite workloads. Namely it gives us 
flexibility similar to the non-slab allocator of old, with even less heap 
fragmentation than the slab allocator, independent of any of the other 
(potential) benefits of off-heap.


> Partially Off Heap Memtables
> ----------------------------
>
>                 Key: CASSANDRA-6689
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Benedict
>            Assignee: Benedict
>             Fix For: 2.1 beta2
>
>
> Move the contents of ByteBuffers off-heap for records written to a memtable.
> (See comments for details)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to