Hi guys, a quick heads up on Mavibot...
Last year, the in-memory MVCC BTree code was completed. Here are the features : - MVCC Btree implementation, using the Garbage Collection to remove defunct pages (associated to defunct versions) - Transaction support (kindof) - Backed on disk every N seconds (we flush the latest version into a file) - Use of a Journal to recover the BTree in case of a crash All in all, it does the job, but can eat a lot of memory (that's convenient though, if there is not too many modifications) The performances are pretty decent : we can add around 800 000 entries per second on my laptop (test done on 1000 trees each one having 100 000 elements added) and more than 15 millions searches per second (on a BTree containing 500 000 elments). What remains to be done for the BTree: -------------------------------------- - support of multi-values in the BTree. Currently, we just allow one <Key, Value> pair to be stored. We would like to store <Ky, {V1, V2, ... Vn}> tuples. This requires some change : we need to add a flag to tell if the bTree can support multi-value (that's the easy part :) and otherwise, either store the values into a sorted array, or a sub-btree containing <Value, Value> tuples. We should probably use both storage : an array up to a number of elements, then a BTree. Why should we use a BTree, and specifically a MVCC Btree as a sub-btree ? Because that would spare us the cost of copyingall the sub-btree when we change one single value for an existng key. The storage : ------------- Currently, the BTree us being flushed on disk regularly, and we use a Journal to be sure that we can recreate a working version of the btree between of two flushes, in case of a crash. That's good enough, assuming that all your values can be stored in memory. But when you have millions of entries, that does not fly. We are working on a different system, which make the btree elements to be stroed as SoftReferences : they can be recalimed by the GC at any moment. If we need to retrieve them at some point, then we will fetch them from the disk. So far, so good, but it requires to write a full storage system. This is the on-going work. We have created a RecordManager in charge of writing the data on disk, and to retrieve them on demand. The RM is using physical fixed size pages which conatin serialized versions of the data structures we used to store the BTree : - Pages (either nodes or leaves) - Values (if they are not directly stored in leaves). This is a work in progress. The RM features are : - capable of storing more than one BTree - writes are serialized, read are always concurrent - should be fast - releases pages are managed by the RM so that we don't have a growing file - transaction support across many btrees (if we have to update more than one BTree for one single operation, then the update should be considered as fully applied or discarded on all the impacted BTrees) - Bulk load support Where we are : -------------- The RM work has been started 3 weeks ago, and we have now a piece of code that can bootstrap (ie, we can write the initial data structure, and inject a first empty BTree). Nothing fancy, but the data structure is ok. The next step will be to support basic operations on a BTree and get the result being stored on disk. That includes addition, replacement, deletion, and of course searches. Those operations should work even if all the data are not in memory. Once done, we will have to focus on the transaction support. At the same time, I think we need to implement the multi-value support. Last, not least, we need to manage the free pages. this is not that complex : when a version is created, we copy a set of pages. Those copied pages are associated with the version they come from. For instance, let's say we have a version V and in order to create a V+1, we copy the pages {p1, p2, ..., pN}. Those copied pages will not be used anymore once the version V will be out-dated. They can then be reclaimed as soon as the last user of thei version leave. We use a BTree to manage the <V, {p1, p2, ..., pN}> tuples (and this is one reason why we need to support multi-values). When V is release, we move all the associated pages into a list of free pages (basically, we link all the pages together and we attach these pages to the existing link of free pages). The RM can now use one of these fre pages when it needs some new pages, instead of creating new pages at the end of the file. Raodmap : --------- There is probably 2 to 3 months of work to get all this being done. We expect to use this system in ApacheDS as a replacement for JDBM. Obviously, this is a critical piece of code that requires a lot of attention, and a lot of testing. However, there is nothing genious or complex in it. It's all based of basic and well knwon algorithm, which requires a lot of attention in order to have something working fast. Once the system will work, I don't think it will evolve a lot in the long term. There are room for improvement like in the way the RM works on SSD vs on spinning disks, or in the cache mamanegment (right now, we don't intend to use a cache, but that could be a good option if we want to favor on BTree over some others). The future : ------------ As we will use this system on Directory sooner or later, it makes no sense to keep this project in labs. The question would be to know if it should be promoted as a TLP (through an incubation step) or being moved under the umbrella of an existing project (like Directory, commons, DB,...). I definitively think that the second option would be simpler, but will not get as much visibility as the first one. Al in all, it will depends on the number of projects that will use it. Currently, we are two working on the project (Kiran and me), which is not enough to make it a potential TLP. At this point, I'd like to get some feedback and inputs :) Thanks ! -- Regards, Cordialement, Emmanuel Lécharny www.iktek.com -- Regards, Cordialement, Emmanuel Lécharny www.iktek.com --------------------------------------------------------------------- To unsubscribe, e-mail: labs-unsubscr...@labs.apache.org For additional commands, e-mail: labs-h...@labs.apache.org