[
https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13916720#comment-13916720
]
Benedict commented on CASSANDRA-6689:
-------------------------------------
[~xedin] the latest latest is offheap2c, which I have just uploaded, and is for
ticket 6694, building on this ticket. I think it's the better one to review,
personally, but if you want to start with this ticket first offheap1b is the
latest, yes.
To possibly aid the reviewers to understand the decisions taken here, I will
outline briefly the main areas of code that have been changed, and why the
approach was selected. I will leave details about _how_ it works to be
addressed by the documentation in the codebase, and won't cover if or why we
want off-heap memtables, only the reasons for implementing them in this way.
First, though, I'd like to potentially put some words in Marcus' mouth, and
suggest that when he says it is "complicated/involved" he's mostly talking
about the complexity in reasoning about thread ordering/behaviour, as we do not
use locks anywhere, and we endeavour to communicate between threads as little
as possible, and we use the JMM to its fullest (and in this ticket only, abuse
ByteBuffer badly, and depend sometimes on x86 ordering guarantees rather than
the JMM). As such it is definitely a complex patch, but actually almost all of
the complexity is in OffHeapCleaner (NativeCleaner in 6694) and
OffHeapAllocator (NativeAllocator), with little extra snippets in the
concurrency utilities, in Referrer/s, and also in the work introduced in
CASSANDRA-5549 which is probably new to everyone reviewing. This is not
complexity in the normal sense of code sprawl (although there are lots of
places touched by this, mostly these are simple modifications), but in that it
requires sitting and thinking hard about the model and the object lifecycles as
documented. This is why I have spent a great deal of time specifying the
assumptions and safety concerns at any of the danger points. This kind of
complexity is difficult to justify in a simple paragraph or three, and either
stands or falls by the code itself and its design decisions; I don't think I
can summarise it here any better than the code comments, and I have no
alternative implementation to compare and contrast with.
In this patch (ignoring CASSANDRA-6694 for now) there are three main areas of
changes:
1) New concurrency utilities
2) OffHeapAllocator et al
3) RefAction changes across the codebase
As to 1, these simply help to make the implementation easier and safer. NBQ
adds a number of useful behaviours, such as CAS-like modification, safe
iterator removal, multiple views on the same queue. These facilities improve
clarity and obviousness in a number of places. WaitQueue and OpOrder are also
improved from those introduced in CASSANDRA-5549. OpOrder prevents garbage
accumulation from old Group objects being kept floating around in Referrer
instances (and makes it faster); WaitQueue reduces the cost of wake-ups by
eliminating a race through use of NBQ. All of the concurrency utilities could
be split off into a separate patch if we wanted.
As to 2 and 3, as a quick starting point which I will refer back to, I will
paste a quick bit about the memory lifecycle from the code:
{code}
* 1) Initially memory is managed like malloc/free: it is considered referenced
at least until the allocation is
* matched by a corresponding free()
* 2) The memory is then protected by the associated OpOrder(s); an allocation
is only available for collection
* once all operations started prior to the free() have completed. These are
effectively short lived transactions that
* should always complete in a timely fashion (never block or wait
indefinitely)
* 3) During this time any operation protected by the read/write OpOrder may
optionally 'ref' {@link Referrer} an object,
* or objects, that each reference (in the normal java sense) some
allocations; once the read operation finishes these
* will have been registered with one or more GC roots {@link Referrers} -
one per participating allocator group (CFS).
* 4) When a GC (or allocator discard) occurs, any extant {@link Referrer} that
were created during an operation that
* began prior to the collect phase are walked, and any regions that are
reachable from any of these 'refs' are
* switched to a refcount phase. When each ref completes it decrements the
count of any such regions it reached.
* Once this count hits 0, the region is finally eligible for reuse.
{code}
Now, as Jonathan mentioned, in CASSANDRA-5549, we introduced the OpOrder
concurrency primitive. This provides a mechanism for cleaning up on-heap
memtables because we can guarantee that no new writes will touch them. They can
also guarantee no new reads will touch them. The problem is that a read's
lifetime lasts beyond touching the memtable, so as soon as we start managing
the lifecycle of the memory, we have to somehow track the lifetime of any
references that escape during a read.
Since these references will be preventing progress/tying up resources, we need
to make sure they're handled safely, so we pass a Referrer (or another
RefAction) into the read methods, which is used to perform any bookkeeping
necessary to ensure this safety. We pass it in from the originator/caller so we
can protect it with a try/finally block as far as possible, although there's a
period for most when they live on a queue only (either waiting to be sent by
MessageService or Netty), and only close up the resources when the message has
been serialized (NB: One thing to explore is actually whether Netty can fail
internally at any time without yielding an error to us on one of the paths we
expect, as this could potentially lead to resource leaks).
Now, to get as far as here I don't think we really have any alternatives
available to us. We have to track some state until we are done with the data.
However at steps 3 and 4 we do have the potential for a different decision: the
referrer currently tracks the specific objects returned by the read; we could
instead track the OpOrder.Group(s) we read the message in. The added complexity
here, however, is not very large (400LOC max), and the benefits to the current
approach are pretty tremendous: we don't snarl up the whole system because of
one slow client (or, say, missing a leak in Netty). Anything we might do to
detect/mitigate the risk would almost certainly be as or more complex, and
would leave me uneasy about its safety. If it went wrong it could spread to the
whole cluster, as any one dying node coould quickly snarl up another, who is
waiting for that node to consume a message they have queued for it, which
snarls up another... and before you know it everyone's having a bad time.
Finally, as an icing on the cake we come to GC. We have to do all of these
above steps anyway for memtable discarding, since we have to reclaim the memory
regions, so we need to know when that's safe to happen. So we piggyback off of
this, and we free() records in memtables whenever they are overwritten, so that
whenever we detect the need to flush we first trigger a global memtable GC,
which will try to reclaim any regions that have free space in them and
consolidate them into new compacted regions. If we bring ourselves under the
threshold for flushing, we don't flush. This addition allows us to have our
cake and eat it when dealing with overwrite workloads. Namely it gives us
flexibility similar to the non-slab allocator of old, with even less heap
fragmentation than the slab allocator, independent of any of the other
(potential) benefits of off-heap.
> Partially Off Heap Memtables
> ----------------------------
>
> Key: CASSANDRA-6689
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6689
> Project: Cassandra
> Issue Type: New Feature
> Components: Core
> Reporter: Benedict
> Assignee: Benedict
> Fix For: 2.1 beta2
>
>
> Move the contents of ByteBuffers off-heap for records written to a memtable.
> (See comments for details)
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)