Le 16/02/2017 à 14:17, Shawn McKinney a écrit : >> On Feb 13, 2017, at 3:30 AM, Emmanuel Lécharny <[email protected]> wrote: >> >> Regardless, even if we commit after every single update, we are on par with >> the old version of Mavibot. >> >> The cursor rewrite makes thebrowsing twice faster (I wrote a 1 million loop >> browsing 1000 values, which took 24s on the previous Mavibot version and >> only 14s in the new version). >> >> What remains to be done : >> - deletions aren't supported >> - Free Page list is not updated >> - There is no cache, which means we will quickly hit a limit : the memory >> - The page reclaimer is not implemented yet >> >> I'll keep going on my free time this week. > Speed is (almost) everything. 40% faster is a significant improvement — nice > work. I have a few questions: > > 1. How fast can it go? (what is maximum theoretical velocity for > inserts/search)
It all depends on the disk speed, and how many time you wait for committing the transaction. At the ed of the day, the limitation is how fast you can flush pages on disk (that's for writes). For read, it's a different story, becuase at some point, you will need some caching (teh reason being that you can't always store all the pages in memory as Java Objects, so you need to read the pages from disk and deserialize them, which is costly). > 2. When will this be ready to insert into apacheds? Good question. It depends on how many time I have to work on Mavibot during nights and week-ends ;-) 2 months ? May be 3 ? It's worth nothing that the previous version of Mavibot already used in ApacheDS, so we already have a MavibotPartition we can reuse, at a minimal cost. > 3. How can we help? Atm, I'm fighting with various aspects of writes. I may commit what I have now, and give some direction on what's going on, if you want to give an hand. The trouble I have is that I'm evaluating many ossible technical choices, with some possible dead-ends that slow me down. Discussing those aspects may speed up the work ! All in all, here are the various parts mavibot is made of : o The B-tree update operations (insert/delete). It's all about how to massage Node/Leaves when adding or removing some data from a B-tree. The algorithm has been implemented, and tested, there is not taht much to be done in this area. Although one part is probably amendable : when a page is odified, and if it has already been modified, there is no mean of copying it into a new page. We currently do that, which is a waste of memory/CPU. o The Write Transaction handling : this is the heary part atm. We want to keep in memory pages that have already been modified, in order to avoid fetching them from disk and to writing them to disk for every single operation. My initial idea was to use the disk offset of each page as a marker : when a Page is read, we knw its offset, and if this offset is met again, then we know we have the page in memory. That works well, except that pages can split or merge. Also in order to know the pages' offset, we have to write them on disk, and it's not necessarily a good idea to do so when some new pages are required. The new idea now is to use a unique ID per page (a Long) that is written in each page. This what I was working on at the end of last week-end. Another problem is to be sure we write pages in the correct order : as eahc BTree Header refers to a RootPage and a BTreeInfo, those two page's index must be know, which means we have to write them on disk beforehand. As youc an see, it's mainly getting your hands dirty at this point, but nothing really complicated. o Once the writes are OK (ie, B-trees get correctly written on disk, and read back), there are two things that need to be done. The first thing is to manage dead pages, typically the CopiedPageBTree pages, which don't have to remain available for ever (ie, past the commit). We just have to move them to the FreePageList, which has to be written on disk. Second part : we have to reuse those free pages instead of allocating new ones at the end of the file. o Reads are qite simple : you fetch a RecordManagerHeader, which is a immutable copy, and that gives you a revision you can play with. The TupleCursor is something you get back when you browse a BTree, and it's already working pretty well. Obviously you can fetch any user's B-tree and browse them. The only part that has to be added is a way to 'remember' which version are in use. The best solution is to use a structure that associates a counter with a revision. Once this counter goes down to 0, the revision is not anymore in use, and teh 'reclaimer' can start reclaiming some of the unused pages for this unused version. o The Reclaimer is the Mavibot GC : it get back unused pages and move them back to teh FreePageList. Now, it's a bit complicated, because it may conflict with the Writer (we have one single writer thread). The question is to have a dedicated reclaimer thread, or to use the writer thread as a reclaimer. I would say that having 2 threads is probably a better idea, because they just share the free page list. This is still an open question o For the reason we haven't yet work very hard on implementing a Reclaimer (although Kiran has written one that works in teh previous Mavibot version), we are currently limited on reclaiming pages for versions that are not used anymore, if and only of there are no older used versions. That means we can't reclaim anything younger than the oldest used revision. We have had ong discussion 3 years ago with Kiran while he was in France for the Paris LDAPCon, and we agreed that there are better algorithms : typically, we can reclaim unused pages as soon as they have been copied by a younger revision. One good example is the root page : it can always be reclaimed when the B-tree revision is not in use anymore, even if we have older revision in use. That's not exactly true for other pages. Teh algorithm is a bit tricky, but we do have the data structure that keeps a track on all the copied pages : they are in the CopiedPagesBTree (for each revision, and each B-tree, we store the list of copied pages in this B-tree). o Cache : we don't have cache atm in this new version. The previous version was using a LRUMap, in order to avoid serializing/deserializing pages over and over, which is CPU intensive. I have tried various other solutions, like WeakReferences, but it's trylly worse (bottom line, do not expect the GC to do the work, it's way too expensive). We want to cache Pages, but in a bit more subtile ways : a Page may content many Values and Keys, and those values and keys are going to be Java Objects which will have to be deserialized. What we use is a KeyHolder and a ValueHolder, which contains the raw value that can be deserialized on demand. Flushing a page is all about flushing the raw values, instead of serializing all the Keys/Values, when only one has changed. The very same thing for reads : we usually read only a Key and a Value at a time, we don't need to deserialize all the other values and keys if teh page wasn't in memory. That saves a hell lot of CPU and time. o Tests : we don't have enough. We most certainly need more. That also probably the best starting point. As you can see, there are a lots of points that need love. Feel free to ask about what you are interested in, I'll try to guide you though teh existing code. I'll try to push a branch tonite. Thanks Shawn ! -- Emmanuel Lecharny Symas.com directory.apache.org
