[
https://issues.apache.org/jira/browse/CASSANDRA-5549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861886#comment-13861886
]
Benedict commented on CASSANDRA-5549:
-------------------------------------
I have a patch available for this
[here|https://github.com/belliottsmith/cassandra/tree/only-5549]
I've been a little reticent to post it, as it's a bit of a monster of a patch,
but I think I've now done my best to keep it well commented and mostly limit
unnecessary changes. There are some changes that may appear over engineered for
their current use, but I am using these in a continuation of this patch for
off-heap memtables. I'll describe some of these below, but unpicking still
useful changes seemed wasteful. If they get in the way of review we can revisit
that decision.
There are several main areas of updates:
1) Removal of switchLock itself: The main work here is actually in the
OpOrdering synchronisation class. This class explains itself, so I won't go
into detail here, but provides an easy mechanism for ensuring we can coordinate
our updates to Memtables so that we know what CL position they contain data to,
and to know when the memtable is safe to be written to disk. The actual
flushing of the memtable has been refactored a little also, to keep ordering
guarantees.
2) Allocators and Memory Management: by removing the switch lock, we get rid of
our ability to control heap growth by row mutations. To fix this, I've created
the concept of a PoolAllocator, with associated Pool that has fixed memory
limits. Any allocation requires the pool to allot room from its limit to the
allocator (this is dealt with by MemoryTracker and MemoryOwner). This required
a lot of minor modifications all over the place, to make measurement of object
sizes at modification time cheap and accurate. Mostly I've achieved this by
modifying jamm - a new branch is
[here|https://github.com/belliottsmith/jamm/tree/guess] so that it will always
give us a useful answer. Wherever we used to be using ObjectSizes adhoc in a
class (generally incorrectly it turns out, not unsurprisingly as the API isn't
obvious) I now *always* call measure() on an instance of the object and store
that in a static field, and use simpler methods for any dynamic space use.
Worth noting: I've renamed IMeasureableMemory.memorySize() to excessHeapSize(),
and I've modified (where applicable) its value to only count data we wouldn't
otherwise be storing. This only makes a difference in a few places, but I think
is an important distinction.
This change also makes any limit on flush queue size irrelevant, so the metric
we use for controlling flushing is instead a ratio of in-use-memory to
memory-limit, ignoring any already flushing data, which once breached will
trigger a flush of the largest CFS.
3) Some concurrency primitives: NonBlockingQueue (and related classes) and
WaitQueue. NonBlockingQueue is used more extensively in the off heap changes,
but I leave it in here because it improves WaitQueue a lot, and we rely on
WaitQueue much more with the proliferation of the OpOrdering operations. It
helps us move much closer to completely non-blocking read/write operations
also. We also use it to get rid of the Thread.yield() in SlabAllocator. I've
aimed to keep NBQ as simple as possible.
4) CommitLog has been updated to use OpOrdering, and also includes a bug fix. I
considered splitting this into a separate ticket, but it's such a tiny
proportion of the overall changes I'm not sure it warrants it. The bug fix we
may want to split out if this takes a while to go through.
> Remove Table.switchLock
> -----------------------
>
> Key: CASSANDRA-5549
> URL: https://issues.apache.org/jira/browse/CASSANDRA-5549
> Project: Cassandra
> Issue Type: Bug
> Reporter: Jonathan Ellis
> Assignee: Benedict
> Labels: performance
> Fix For: 2.1
>
> Attachments: 5549-removed-switchlock.png, 5549-sunnyvale.png
>
>
> As discussed in CASSANDRA-5422, Table.switchLock is a bottleneck on the write
> path. ReentrantReadWriteLock is not lightweight, even if there is no
> contention per se between readers and writers of the lock (in Cassandra,
> memtable updates and switches).
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)