[
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321332#comment-15321332
]
Michael Kjellman commented on CASSANDRA-9754:
---------------------------------------------
Alright, happy to finally be able to write this. :) I'm attaching a v1 diff
containing Birch!
h4. Why is it named Birch?
B+ Tree -> Trees that start with the letter B -> Birch... get it? haha...
h4. Description
Birch is a B+ish/inspired tree aimed at improving the performance of the
SSTable index in Cassandra (especially with large partitions).
The existing implementation scales poorly with the size of the index/ row as
the entire index must be deserialized onto the heap even to look for a single
element. This puts significant pressure on the heap, where one read to a large
partition will cause at the minimum a long painful CMS GC pause or -- in the
worst case -- an OOM.
The Birch implementation has a predictable fixed cost for reads at the expense
of the additional on disk overhead for the tree itself -- with an
implementation that is the same complexity O(log(n)) as the existing
implementation. Every row added to the SSTable is also added to the primary
index. If the size of the row is greater than 64kb we build an index (otherwise
we just encode the position in the sstable for that row). All entries encoded
into the index are page aligned and padded to the nearest boundary (4096 bytes
by default). Every segment can be marked as either internally padded/aligned
along a boundary or non-padded/aligned (up to 2GB). Birch indexes are aligned
into 4096 byte nodes (both leaf and inner). Keys will be encoded inside the
node itself, unless they exceed the size of the node/2. In that case, the size
of the node/2 is encoded into the node itself and the offset of the remaining
bytes in the overflow page is encoded. This enables predictable fixed
performance of the tree, but accommodates variable length keys/elements.
h4. Notes on v1 of the diff (in no particular order)
* I broke the changes into two logical parts: The first abstracts out the
existing Index implementation and adds no new logic. The second includes a
IndexedEntry implementation backed by a Birch tree.
* The attached v1 patch is written for 2.1, I have already started rebasing
the patch onto trunk and hope to finish that shortly and post a the trunk based
patch
* There's some high level Javadoc documentation in BirchWriter and
PageAlignedWriter on the layout of the tree on disk, serialization and
deserialization paths, and higher level goals of the classes
* The next steps are to start getting feedback from reviews and the community.
I also have profiled the tree itself but profiling the tree integrated into the
stack and optimizing non-performant code paths is next (after the immediate
task to rebase the change onto trunk)
* There are still a few todo's I've left in regards to handling backwards
compatibility, parts of the code I expect might be non-performant, and things
I'd like to discuss on the "correct" implementation/behavior etc
* I have a few unit tests that still don't pass and still need to be root
caused... I've taken the approach this entire time that the unit tests
shouldn't be touched to pass, so there is still a few behavioral regressions
I've accidentally introduced. The current failing tests are:
** AutoSavingCacheTest
** SecondaryIndexTest
** BatchlogManagerTest
** KeyCacheTest
** ScrubTest
** IndexSummaryManagerTest
** LegacySSTableTest
** MultiSliceTest
* I need to write a unit test to test reading the legacy/existing primary
index implementation
* By the nature of the index's role in the database, the unit test coverage is
actually pretty extensive as any read and write touches the index in some
capacity
I'll be giving a talk at NGCC tomorrow (Thursday the 9th) to go over the high
level design I ended up with and considerations I had to take into account once
I actually got deep inside this part of the code.
Looking forward to feedback!
> Make index info heap friendly for large CQL partitions
> ------------------------------------------------------
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
> Issue Type: Improvement
> Reporter: sankalp kohli
> Assignee: Michael Kjellman
> Priority: Minor
>
> Looking at a heap dump of 2.0 cluster, I found that majority of the objects
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for
> GC. Can this be improved by not creating so many objects?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)