On Wed, Sep 24, 2014 at 2:45 PM, Emmanuel Lécharny <[email protected]> wrote:
> Hi guys, > > a quick update on the bulkloader development...Using what Kiran did in > the mavibot-partition-bulkloader, I moved the part that specifically > handle the BTree bulkloading to Mavibot, because there is no reason > someone could not bulkload any kind of data into a BTree. > > I created a BulkLoader class in Mavibot for that purpose. The bulk > loading is a 2 phases process : > - sort the data > - create the btree > > We had something working for in-memory data (ie, when you can read > everything in memory), but when it comes to read huge set of data, you > have to write sorted data on disk. This was what I was working on > lately. Basically, we read a chunk of data, and when we reach a given > number (configurable), we sort what we read and write it back on disk. I > tested it with 10M elements, written on 100 files. The deal is to merge > the sorted data when we read them back from the files (as each file is > sorted, it's just a matter to take the sammlest value from all the files). > > All in all, this part is now working. > > The second phase was an issue in the current bulkload implementation : > it left the btree unbalanced (when you fill the leaves completely, the > last leaf may contain less than N/2 elements : the current > implementation didn't handle this case). It was quite a challenge to > balance a btree when you were buildling it, as the pb spread to all the > upper layers. After a few hours thinking about the best way to do it > without having to shuffle many already written pages, I realized that we > actually *know* how many entries we have in the BTree, as we have sorted > the entries at first ! Knowing the numbe rof elements is the key : you > can predict how the btree should be balanced. > > This is the part I'm working on atm. > > There are a few things that need some love still : > - the entrySerializer has to be moved from JDBM to the Mavibot Partition > can we move this to a xdbm-partition module? > - the Mavibot partition builder has to be rewritten using the Mavibot > bulkloader (but that should be easy, as we just have to call the > BulkLoader primitive) > - the In-memory bulk load has to be implemented > - I'd like to implement a compact() method taht takes an existing BTree > and compact it on the fly > hmm, how about applying this to the entire database level rather than just an individual BTree? This helps us in writing all the data to a new file when the existing file grows with too many free pages or sparse nodes. - we have to deal with the RecordManager, because atm we can't have it > opened while processing a btree. It could be interesting to be able to > compact a btree while the RM is up and running (assuming no write > operation is done in the mean time). > > right, this is why I think compacting should be applied to entire database instead of a single BTree. otoh, bulkloader will be quite handy here while compacting, all we need to do is pass the relevant cursors of the existing BTrees. > Here is where we are atm. More to come in the next days ! > > > > > -- Kiran Ayyagari http://keydap.com
