[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321332#comment-15321332
 ] 

Michael Kjellman commented on CASSANDRA-9754:
---------------------------------------------

Alright, happy to finally be able to write this. :) I'm attaching a v1 diff 
containing Birch!

h4. Why is it named Birch?
B+ Tree -> Trees that start with the letter B -> Birch... get it? haha...

h4. Description
Birch is a B+ish/inspired tree aimed at improving the performance of the 
SSTable index in Cassandra (especially with large partitions).

The existing implementation scales poorly with the size of the index/ row as 
the entire index must be deserialized onto the heap even to look for a single 
element. This puts significant pressure on the heap, where one read to a large 
partition will cause at the minimum a long painful CMS GC pause or -- in the 
worst case -- an OOM. 

The Birch implementation has a predictable fixed cost for reads at the expense 
of the additional on disk overhead for the tree itself -- with an 
implementation that is the same complexity O(log(n)) as the existing 
implementation. Every row added to the SSTable is also added to the primary 
index. If the size of the row is greater than 64kb we build an index (otherwise 
we just encode the position in the sstable for that row). All entries encoded 
into the index are page aligned and padded to the nearest boundary (4096 bytes 
by default). Every segment can be marked as either internally padded/aligned 
along a boundary or non-padded/aligned (up to 2GB). Birch indexes are aligned 
into 4096 byte nodes (both leaf and inner). Keys will be encoded inside the 
node itself, unless they exceed the size of the node/2. In that case, the size 
of the node/2 is encoded into the node itself and the offset of the remaining 
bytes in the overflow page is encoded. This enables predictable fixed 
performance of the tree, but accommodates variable length keys/elements.

h4. Notes on v1 of the diff (in no particular order)
 * I broke the changes into two logical parts: The first abstracts out the 
existing Index implementation and adds no new logic. The second includes a 
IndexedEntry implementation backed by a Birch tree.
 * The attached v1 patch is written for 2.1, I have already started rebasing 
the patch onto trunk and hope to finish that shortly and post a the trunk based 
patch
 * There's some high level Javadoc documentation in BirchWriter and 
PageAlignedWriter on the layout of the tree on disk, serialization and 
deserialization paths, and higher level goals of the classes
 * The next steps are to start getting feedback from reviews and the community. 
I also have profiled the tree itself but profiling the tree integrated into the 
stack and optimizing non-performant code paths is next (after the immediate 
task to rebase the change onto trunk)
 * There are still a few todo's I've left in regards to handling backwards 
compatibility, parts of the code I expect might be non-performant, and things 
I'd like to discuss on the "correct" implementation/behavior etc
 * I have a few unit tests that still don't pass and still need to be root 
caused... I've taken the approach this entire time that the unit tests 
shouldn't be touched to pass, so there is still a few behavioral regressions 
I've accidentally introduced. The current failing tests are: 
 ** AutoSavingCacheTest
 ** SecondaryIndexTest
 ** BatchlogManagerTest
 ** KeyCacheTest
 ** ScrubTest
 ** IndexSummaryManagerTest
 ** LegacySSTableTest
 ** MultiSliceTest
 * I need to write a unit test to test reading the legacy/existing primary 
index implementation
 * By the nature of the index's role in the database, the unit test coverage is 
actually pretty extensive as any read and write touches the index in some 
capacity

I'll be giving a talk at NGCC tomorrow (Thursday the 9th) to go over the high 
level design I ended up with and considerations I had to take into account once 
I actually got deep inside this part of the code.

Looking forward to feedback!

> Make index info heap friendly for large CQL partitions
> ------------------------------------------------------
>
>                 Key: CASSANDRA-9754
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: sankalp kohli
>            Assignee: Michael Kjellman
>            Priority: Minor
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to