[ 
https://issues.apache.org/jira/browse/CASSANDRA-6271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-6271:
--------------------------------

    Attachment: oprate.svg

I've uploaded a patch to 
[6271|https://github.com/belliottsmith/cassandra/tree/iss-6271].

This patch replaces AtomicSortedColumns with AtomicBTreeColumns in Memtable. 
The BTree is a custom CoW job, that minimises memory utilisation without 
sacrificing performance. Modifications are performed in an actual batch (as 
opposed to a simulated batch as currently the case), using a builder to 
construct a new tree from as many parts of the old tree as possible. The 
fan-factor is currently set to 32, which was both experimentally and logically 
a good choice (I was expecting 16-32, to ensure we don't have too many cache 
lines for search in a given node, nor high merge costs).

Each node, and the BTree itself, are represented as just an Object[], with 
utility functions to operate on the root node. This is a little anti-OOP, but I 
don't think a full fledged Set object is either called for or helpful here, as 
we have only a small number of operations we perform on the trees, and we'll 
only use it in fairly controlled circumstances; and this offers us some further 
memory savings.

In synthetic benchmarks, I found this BTree to be slower than TreeSet by around 
25-50%, but found that SnapTreeMap (when used in the same manner we use it in 
the code) was a similar degree slower again than the BTree, which was a good 
start. This is despite trying my best to induce worst-case behaviour from the 
BTree, and was persistent across all the tests I performed.

The stress benchmarks are pretty promising too. I've attached graphs of this 
patch against trunk, and against this patch combined with my patch for 3578. 
The result is roughly 50% greater throughput for writes, with both patches. In 
fact, it got to the point where a node was flat out just running stress against 
the 4x cluster. I think we can push this a little further with a patch for 
switch lock removal, but we'll probably only see a small uptick from that. I 
also tested the patch for highly contended writes of a small number of rows, 
and found no measureable difference in performance.

Note that this patch does not currently implement iterator removal for the 
BTree. I'm not sure yet how is best to do it, and am leaning towards a simple 
replacement of the Column that's deleted with a special DeletedColumn, that is 
filtered out on iteration. This would mean .size() would either have to be 
slower or inaccurate, so I need to do some analysis to determine if this is 
okay, or if I should bite the bullet and implement a full batch delete 
operation on the BTree, but this is a bit of a pig to do whilst ensuring a 
balanced tree, and sometimes wasteful. I could also not balance the tree, or 
only make a 'best effort' that avoids re-allocating the same node multiple 
times even if it would result in imbalance. 

Suggestions welcome.

> Replace SnapTree in AtomicSortedColumns
> ---------------------------------------
>
>                 Key: CASSANDRA-6271
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6271
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Benedict
>            Assignee: Benedict
>              Labels: performance
>         Attachments: oprate.svg
>
>
> On the write path a huge percentage of time is spent in GC (>50% in my tests, 
> if accounting for slow down due to parallel marking). SnapTrees are both GC 
> unfriendly due to their structure and also very expensive to keep around - 
> each column name in AtomicSortedColumns uses > 100 bytes on average 
> (excluding the actual ByteBuffer).
> I suggest using a sorted array; changes are supplied at-once, as opposed to 
> one at a time, and if < 10% of the keys in the array change (and data equal 
> to < 10% of the size of the key array) we simply overlay a new array of 
> changes only over the top. Otherwise we rewrite the array. This method should 
> ensure much less GC overhead, and also save approximately 80% of the current 
> memory overhead.
> TreeMap is similarly difficult object for the GC, and a related task might be 
> to remove it where not strictly necessary, even though we don't keep them 
> hanging around for long. TreeMapBackedSortedColumns, for instance, seems to 
> be used in a lot of places where we could simply sort the columns.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to