Hi !
That was not teh most efficient week-end in my life :-) Not that I didn't spent time on the code, but I faced some unexpected issues... This is where coding is far less productive when you haven't used a piece of paper before. I would blame my fragmented time... Here is what problem I faced : when I update a B-tree many times in one single transaction, page split will occur, and that will impact the B-tree structure. So far, so good, nothing new under the sun. The problem is that I'm working in memory, and when I have to flus the pages on disk, I have to do it in reverse order (ie, the leaves first, then the parent node, etc up to the root page, and ending with the btree header). This is mandatory, because on disk, they refer to each other by their offset on disk, and if I haven't wrote a leaf on disk before its parent's node, I can't update the offset in the node. I don't use offset any more as a page unique identifier, because it would force me to flush new pages prematurely on disk (well, not exactly : I just have to ask a new page with an offset, and that requires an access to the free page list). So in order to avoid such a problem, I new attribute a unique ID to each page. This unique ID is incremental, and I use a Long for that (which allows Mavibot to create up to 9 exa-pages - remember : kilo, mega, giga, tera, penta, exa - or 10^19 pages. If we create 1 billion page per second, it would take 10 billion seconds to reach the limit, ie 300 years.) The consequence is that I have to use a stack of modified pages where I push the leaves, the parent's node, the node's parent's node, etc, up to the root node, and the btree-header Then I can flush all the pages on disk, and teh offset wil be updated properly. But at the same time, I want to be able to retrieve a page in memory quickly, in order to reuse it when in a transaction, instead of having to fetch it from disk. So I needed a different data structure : a stack was a List, and searching in a list is linear. I'm now using a TreeMap. That solves the issue, *but* I do thnk it's not ideal. Managing a tree-map is costly, compared to scan a list, when it comes to manage a few items. For a big B-tree, we are likely to have only 6 or 7 pages in the stack, so it *might* be faster to use a list rather than a TreeMap... Ok, ok, we are talking about micro-optimisation here :-) So at the end of the day, I'm facing some issue with offset computations in some cases, because I don't write pages in the correct order in all the possible cases (typically, I fixed an issue where one of the B-tree was split, after 7 insertions, then another issue after 8 insertions, then another after 11 insertions, and now I have another incorrect offset after 24 insertions...). It's all about pushing the pages in the correct order, which depends on teh Page ID, which sometime is not correctly updated. It looks like moving forward, and then backward, but at the end of the day, it's really fixing fondamental design issues that became apparent after doing some more thorough tests... I'll keep you posted newt week-end ! -- Emmanuel Lecharny Symas.com directory.apache.org
