Mavibot bulkloader update

Emmanuel Lécharny Wed, 24 Sep 2014 02:16:22 -0700

Hi guys,

a quick update on the bulkloader development...Using what Kiran did in
the mavibot-partition-bulkloader, I moved the part that specifically
handle the BTree bulkloading to Mavibot, because there is no reason
someone could not bulkload any kind of data into a BTree.


I created a BulkLoader class in Mavibot for that purpose. The bulk
loading is a 2 phases process :
- sort the data
- create the btree

We had something working for in-memory data (ie, when you can read
everything in memory), but when it comes to read huge set of data, you
have to write sorted data on disk. This was what I was working on
lately. Basically, we read a chunk of data, and when we reach a given
number (configurable), we sort what we read and write it back on disk. I
tested it with 10M elements, written on 100 files. The deal is to merge
the sorted data when we read them back from the files (as each file is
sorted, it's just a matter to take the sammlest value from all the files).

All in all, this part is now working.

The second phase was an issue in the current bulkload implementation :
it left the btree unbalanced (when you fill the leaves completely, the
last leaf may contain less than N/2 elements : the current
implementation didn't handle this case). It was quite a challenge to
balance a btree when you were buildling it, as the pb spread to all the
upper layers. After a few hours thinking about the best way to do it
without having to shuffle many already written pages, I realized that we
actually *know* how many entries we have in the BTree, as we have sorted
the entries at first ! Knowing the numbe rof elements is the key : you
can predict how the btree should be balanced.

This is the part I'm working on atm.

There are a few things that need some love still :
- the entrySerializer has to be moved from JDBM to the Mavibot Partition
- the Mavibot partition builder has to be rewritten using the Mavibot
bulkloader (but that should be easy, as we just have to call the
BulkLoader primitive)
- the In-memory bulk load has to be implemented
- I'd like to implement a compact() method taht takes an existing BTree
and compact it on the fly
- we have to deal with the RecordManager, because atm we can't have it
opened while processing a btree. It could be interesting to be able to
compact a btree while the RM is up and running (assuming no write
operation is done in the mean time).

Here is where we are atm. More to come in the next days !

Mavibot bulkloader update

Reply via email to