Re: [Mavibot] Update

Emmanuel Lécharny Thu, 16 Feb 2017 06:10:49 -0800


Le 16/02/2017 à 14:17, Shawn McKinney a écrit :
>> On Feb 13, 2017, at 3:30 AM, Emmanuel Lécharny <[email protected]> wrote:
>>
>> Regardless, even if we commit after every single update, we are on par with 
>> the old version of Mavibot.
>>
>> The cursor rewrite makes thebrowsing twice faster (I wrote a 1 million loop 
>> browsing 1000 values, which took 24s on the previous Mavibot version and 
>> only 14s in the new version).
>>
>> What remains to be done :
>> - deletions aren't supported
>> - Free Page list is not updated
>> - There is no cache, which means we will quickly hit a limit : the memory
>> - The page reclaimer is not implemented yet
>>
>> I'll keep going on my free time this week.
> Speed is (almost) everything. 40% faster is a significant improvement — nice 
> work.  I have a few questions:
>
> 1. How fast can it go?  (what is maximum theoretical velocity for 
> inserts/search)


It all depends on the disk speed, and how many time you wait for
committing the transaction. At the ed of the day, the limitation is how
fast you can flush pages on disk (that's for writes). For read, it's a
different story, becuase at some point, you will need some caching (teh
reason being that you can't always store all the pages in memory as Java
Objects, so you need to read the pages from disk and deserialize them,
which is costly).

> 2. When will this be ready to insert into apacheds?

Good question. It depends on how many time I have to work on Mavibot
during nights and week-ends ;-) 2 months ? May be 3 ? It's worth nothing
that the previous version of Mavibot already used in ApacheDS, so we
already have a MavibotPartition we can reuse, at a minimal cost.

> 3. How can we help?

Atm, I'm fighting with various aspects of writes. I may commit what I
have now, and give some direction on what's going on, if you want to
give an hand. The trouble I have is that I'm evaluating many ossible
technical choices, with some possible dead-ends that slow me down.
Discussing those aspects may speed up the work !

All in all, here are the various parts mavibot is made of :

o The B-tree update operations (insert/delete). It's all about how to
massage Node/Leaves when adding or removing some data from a B-tree. The
algorithm has been implemented, and tested, there is not taht much to be
done in this area. Although one part is probably amendable : when a page
is odified, and if it has already been modified, there is no mean of
copying it into a new page. We currently do that, which is a waste of
memory/CPU.

o The Write Transaction handling : this is the heary part atm. We want
to keep in memory pages that have already been modified, in order to
avoid fetching them from disk and to writing them to disk for every
single operation. My initial idea was to use the disk offset of each
page as a marker : when a Page is read, we knw its offset, and if this
offset is met again, then we know we have the page in memory. That works
well, except that pages can split or merge. Also in order to know the
pages' offset, we have to write them on disk, and it's not necessarily a
good idea to do so when some new pages are required. The new idea now is
to use a unique ID per page (a Long) that is written in each page. This
what I was working on at the end of last week-end.
Another problem is to be sure we write pages in the correct order : as
eahc BTree Header refers to a RootPage and a BTreeInfo, those two page's
index must be know, which means we have to write them on disk
beforehand. As youc an see, it's mainly getting your hands dirty at this
point, but nothing really complicated.

o Once the writes are OK (ie, B-trees get correctly written on disk, and
read back), there are two things that need to be done. The first thing
is to manage dead pages, typically the CopiedPageBTree pages, which
don't have to remain available for ever (ie, past the commit). We just
have to move them to the FreePageList, which has to be written on disk.
Second part : we have to reuse those free pages instead of allocating
new ones at the end of the file.

o Reads are qite simple : you fetch a RecordManagerHeader, which is a
immutable copy, and that gives you a revision you can play with. The
TupleCursor is something you get back when you browse a BTree, and it's
already working pretty well. Obviously you can fetch any user's B-tree
and browse them. The only part that has to be added is a way to
'remember' which version are in use. The best solution is to use a
structure that associates a counter with a revision. Once this counter
goes down to 0, the revision is not anymore in use, and teh 'reclaimer'
can start reclaiming some of the unused pages for this unused version.

o The Reclaimer is the Mavibot GC : it get back unused pages and move
them back to teh FreePageList. Now, it's a bit complicated, because it
may conflict with the Writer (we have one single writer thread). The
question is to have a dedicated reclaimer thread, or to use the writer
thread as a reclaimer. I would say that having 2 threads is probably a
better idea, because they just share the free page list. This is still
an open question

o For the reason we haven't yet work very hard on implementing a
Reclaimer (although Kiran has written one that works in teh previous
Mavibot version), we are currently limited on reclaiming pages for
versions that are not used anymore, if and only of there are no older
used versions. That means we can't reclaim anything younger than the
oldest used revision. We have had ong discussion 3 years ago with Kiran
while he was in France for the Paris LDAPCon, and we agreed that there
are better algorithms : typically, we can reclaim unused pages as soon
as they have been copied by a younger revision. One good example is the
root page : it can always be reclaimed when the B-tree revision is not
in use anymore, even if we have older revision in use. That's not
exactly true for other pages. Teh algorithm is a bit tricky, but we do
have the data structure that keeps a track on all the copied pages :
they are in the CopiedPagesBTree (for each revision, and each B-tree, we
store the list of copied pages in this B-tree).

o Cache : we don't have cache atm in this new version. The previous
version was using a LRUMap, in order to avoid serializing/deserializing
pages over and over, which is CPU intensive. I have tried various other
solutions, like WeakReferences, but it's trylly worse (bottom line, do
not expect the GC to do the work, it's way too expensive). We want to
cache Pages, but in a bit more subtile ways : a Page may content many
Values and Keys, and those values and keys are going to be Java Objects
which will have to be deserialized. What we use is a KeyHolder and a
ValueHolder, which contains the raw value that can be deserialized on
demand. Flushing a page is all about flushing the raw values, instead of
serializing all the Keys/Values, when only one has changed. The very
same thing for reads : we usually read only a Key and a Value at a time,
we don't need to deserialize all the other values and keys if teh page
wasn't in memory. That saves a hell lot of CPU and time.

o Tests : we don't have enough. We most certainly need more. That also
probably the best starting point.


As you can see, there are a lots of points that need love. Feel free to
ask about what you are interested in, I'll try to guide you though teh
existing code.

I'll try to push a branch tonite.

Thanks Shawn !


-- 

Emmanuel Lecharny

Symas.com
directory.apache.org

Re: [Mavibot] Update

Reply via email to