Hi guys,

a quick heads up on Mavibot...

Last year, the in-memory MVCC BTree code was completed. Here are the features :
- MVCC Btree implementation, using the Garbage Collection to remove defunct 
pages (associated to defunct versions)
- Transaction support (kindof)
- Backed on disk every N seconds (we flush the latest version into a file)
- Use of a Journal to recover the BTree in case of a crash

All in all, it does the job, but can eat a lot of memory (that's convenient 
though, if there is not too many modifications)

The performances are pretty decent : we can add around 800 000 entries per 
second on my laptop (test done on 1000 trees each one having 100 000 elements 
added) and more than 15 millions searches per second (on a BTree containing 500 
000 elments).

What remains to be done for the BTree:
--------------------------------------
- support of multi-values in the BTree. Currently, we just allow one <Key, 
Value> pair to be stored. We would like to store <Ky, {V1, V2, ... Vn}> tuples. 
This requires some change : we need to add a flag to tell if the bTree can 
support multi-value (that's the easy part :) and otherwise, either store the 
values into a sorted array, or a sub-btree containing <Value, Value> tuples. We 
should probably use both storage : an array up to a number of elements, then a 
BTree. 

Why should we use a BTree, and specifically a MVCC Btree as a sub-btree ? 
Because that would spare us the cost of copyingall the sub-btree when we change 
one single value for an existng key.


The storage :
-------------

Currently, the BTree us being flushed on disk regularly, and we use a Journal 
to be sure that we can recreate a working version of the btree between of two 
flushes, in case of a crash.

That's good enough, assuming that all your values can be stored in memory. But 
when you have millions of entries, that does not fly.

We are working on a different system, which make the btree elements to be 
stroed as SoftReferences : they can be recalimed by the GC at any moment. If we 
need to retrieve them at some point, then we will fetch them from the disk. So 
far, so good, but it requires to write a full storage system. This is the 
on-going work.

We have created a RecordManager in charge of writing the data on disk, and to 
retrieve them on demand. The RM is using physical fixed size pages which 
conatin serialized versions of the data structures we used to store the BTree :
- Pages (either nodes or leaves)
- Values (if they are not directly stored in leaves).

This is a work in progress.

The RM features are :
- capable of storing more than one BTree
- writes are serialized, read are always concurrent
- should be fast
- releases pages are managed by the RM so that we don't have a growing file
- transaction support across many btrees (if we have to update more than one 
BTree for one single operation, then the update should be considered as fully 
applied or discarded on all the impacted BTrees)
- Bulk load support


Where we are :
--------------

The RM work has been started 3 weeks ago, and we have now a piece of code that 
can bootstrap (ie, we can write the initial data structure, and inject a first 
empty BTree). Nothing fancy, but the data structure is ok. 

The next step will be to support basic operations on a BTree and get the result 
being stored on disk. That includes addition, replacement, deletion, and of 
course searches. Those operations should work even if all the data are not in 
memory.

Once done, we will have to focus on the transaction support. At the same time, 
I think we need to implement the multi-value support.

Last, not least, we need to manage the free pages. this is not that complex : 
when a version is created, we copy a set of pages. Those copied pages are 
associated with the version they come from. For instance, let's say we have a 
version V and in order to create a V+1, we copy the pages {p1, p2, ..., pN}. 
Those copied pages will not be used anymore once the version V will be 
out-dated. They can then be reclaimed as soon as the last user of thei version 
leave. We use a BTree to manage the <V, {p1, p2, ..., pN}> tuples (and this is 
one reason why we need to support multi-values). When V is release, we move all 
the associated pages into a list of free pages (basically, we link all the 
pages together and we attach these pages to the existing link of free pages). 
The RM can now use one of these fre pages when it needs some new pages, instead 
of creating new pages at the end of the file.

Raodmap :
---------

There is probably 2 to 3 months of work to get all this being done. We expect 
to use this system in ApacheDS as a replacement for JDBM. 

Obviously, this is a critical piece of code that requires a lot of attention, 
and a lot of testing. However, there is nothing genious or complex in it. It's 
all based of basic and well knwon algorithm, which requires a lot of attention 
in order to have something working fast.

Once the system will work, I don't think it will evolve a lot in the long term. 
There are room for improvement like in the way the RM works on SSD vs on 
spinning disks, or in the cache mamanegment (right now, we don't intend to use 
a cache, but that could be a good option if we want to favor on BTree over some 
others).

The future :
------------
As we will use this system on Directory sooner or later, it makes no sense to 
keep this project in labs. The question would be to know if it should be 
promoted as a TLP (through an incubation step) or being moved under the 
umbrella of an existing project (like Directory, commons, DB,...). I 
definitively think that the second option would be simpler, but will not get as 
much visibility as the first one. Al in all, it will depends on the number of 
projects that will use it. Currently, we are two working on the project (Kiran 
and me), which is not enough to make it a potential TLP.

At this point, I'd like to get some feedback and inputs :)


Thanks !

-- Regards, Cordialement, Emmanuel Lécharny www.iktek.com

-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com 


---------------------------------------------------------------------
To unsubscribe, e-mail: labs-unsubscr...@labs.apache.org
For additional commands, e-mail: labs-h...@labs.apache.org

Reply via email to