Hi guys, a quick update on the bulkloader development...Using what Kiran did in the mavibot-partition-bulkloader, I moved the part that specifically handle the BTree bulkloading to Mavibot, because there is no reason someone could not bulkload any kind of data into a BTree.
I created a BulkLoader class in Mavibot for that purpose. The bulk loading is a 2 phases process : - sort the data - create the btree We had something working for in-memory data (ie, when you can read everything in memory), but when it comes to read huge set of data, you have to write sorted data on disk. This was what I was working on lately. Basically, we read a chunk of data, and when we reach a given number (configurable), we sort what we read and write it back on disk. I tested it with 10M elements, written on 100 files. The deal is to merge the sorted data when we read them back from the files (as each file is sorted, it's just a matter to take the sammlest value from all the files). All in all, this part is now working. The second phase was an issue in the current bulkload implementation : it left the btree unbalanced (when you fill the leaves completely, the last leaf may contain less than N/2 elements : the current implementation didn't handle this case). It was quite a challenge to balance a btree when you were buildling it, as the pb spread to all the upper layers. After a few hours thinking about the best way to do it without having to shuffle many already written pages, I realized that we actually *know* how many entries we have in the BTree, as we have sorted the entries at first ! Knowing the numbe rof elements is the key : you can predict how the btree should be balanced. This is the part I'm working on atm. There are a few things that need some love still : - the entrySerializer has to be moved from JDBM to the Mavibot Partition - the Mavibot partition builder has to be rewritten using the Mavibot bulkloader (but that should be easy, as we just have to call the BulkLoader primitive) - the In-memory bulk load has to be implemented - I'd like to implement a compact() method taht takes an existing BTree and compact it on the fly - we have to deal with the RecordManager, because atm we can't have it opened while processing a btree. It could be interesting to be able to compact a btree while the RM is up and running (assuming no write operation is done in the mean time). Here is where we are atm. More to come in the next days !
