[
https://issues.apache.org/jira/browse/CASSANDRA-6271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benedict updated CASSANDRA-6271:
--------------------------------
Attachment: oprate.svg
I've uploaded a patch to
[6271|https://github.com/belliottsmith/cassandra/tree/iss-6271].
This patch replaces AtomicSortedColumns with AtomicBTreeColumns in Memtable.
The BTree is a custom CoW job, that minimises memory utilisation without
sacrificing performance. Modifications are performed in an actual batch (as
opposed to a simulated batch as currently the case), using a builder to
construct a new tree from as many parts of the old tree as possible. The
fan-factor is currently set to 32, which was both experimentally and logically
a good choice (I was expecting 16-32, to ensure we don't have too many cache
lines for search in a given node, nor high merge costs).
Each node, and the BTree itself, are represented as just an Object[], with
utility functions to operate on the root node. This is a little anti-OOP, but I
don't think a full fledged Set object is either called for or helpful here, as
we have only a small number of operations we perform on the trees, and we'll
only use it in fairly controlled circumstances; and this offers us some further
memory savings.
In synthetic benchmarks, I found this BTree to be slower than TreeSet by around
25-50%, but found that SnapTreeMap (when used in the same manner we use it in
the code) was a similar degree slower again than the BTree, which was a good
start. This is despite trying my best to induce worst-case behaviour from the
BTree, and was persistent across all the tests I performed.
The stress benchmarks are pretty promising too. I've attached graphs of this
patch against trunk, and against this patch combined with my patch for 3578.
The result is roughly 50% greater throughput for writes, with both patches. In
fact, it got to the point where a node was flat out just running stress against
the 4x cluster. I think we can push this a little further with a patch for
switch lock removal, but we'll probably only see a small uptick from that. I
also tested the patch for highly contended writes of a small number of rows,
and found no measureable difference in performance.
Note that this patch does not currently implement iterator removal for the
BTree. I'm not sure yet how is best to do it, and am leaning towards a simple
replacement of the Column that's deleted with a special DeletedColumn, that is
filtered out on iteration. This would mean .size() would either have to be
slower or inaccurate, so I need to do some analysis to determine if this is
okay, or if I should bite the bullet and implement a full batch delete
operation on the BTree, but this is a bit of a pig to do whilst ensuring a
balanced tree, and sometimes wasteful. I could also not balance the tree, or
only make a 'best effort' that avoids re-allocating the same node multiple
times even if it would result in imbalance.
Suggestions welcome.
> Replace SnapTree in AtomicSortedColumns
> ---------------------------------------
>
> Key: CASSANDRA-6271
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6271
> Project: Cassandra
> Issue Type: Improvement
> Reporter: Benedict
> Assignee: Benedict
> Labels: performance
> Attachments: oprate.svg
>
>
> On the write path a huge percentage of time is spent in GC (>50% in my tests,
> if accounting for slow down due to parallel marking). SnapTrees are both GC
> unfriendly due to their structure and also very expensive to keep around -
> each column name in AtomicSortedColumns uses > 100 bytes on average
> (excluding the actual ByteBuffer).
> I suggest using a sorted array; changes are supplied at-once, as opposed to
> one at a time, and if < 10% of the keys in the array change (and data equal
> to < 10% of the size of the key array) we simply overlay a new array of
> changes only over the top. Otherwise we rewrite the array. This method should
> ensure much less GC overhead, and also save approximately 80% of the current
> memory overhead.
> TreeMap is similarly difficult object for the GC, and a related task might be
> to remove it where not strictly necessary, even though we don't keep them
> hanging around for long. TreeMapBackedSortedColumns, for instance, seems to
> be used in a lot of places where we could simply sort the columns.
--
This message was sent by Atlassian JIRA
(v6.1#6144)