Emmanuel Lécharny wrote:
Hi guys,
while 1.0.0-M3 is being voted, I started to work on the next iteration.
Kiran is also working on the bulk load tool at the same time, so I will
work on the transaction support.
Updating the bTree is a fairly costly operation, as we have to copy many
pages, reclaim some free pages to store new data, and update the header
many times. We could save many of those writes if we do it at the end of
the update. One of the biggest save would be to write the header only
once (currently, it's being updated every time we ask for a free page).
But even if we implement such a mechanism, there are cases where we
would like to save more. Here are two possible use cases :
- we are injecting many entries in a BTree
- we are updating many BTrees but we want a consistant backend even if
there is a crash in the middle of all those updates, across oall the
updated BTrees (this is typically what we have to guarantee when
updating entries in LDAP).
So where does it bring us ? We need to avoid writing to disk as much as
we can up to the point we can do it safely. The idea is to copy the
pages in memory, without flushing them on disk, until we commit (or
rollback) the operation. If we have to copy the same page more than
once, then we will update the copy in memory. In any case, we operate on
a given revision for all the operation, regardless the number of updates.
When the user is done, we have two use cases :
- a commit is called :
we flush on disk the pages we have stored in memory, and when it's done,
we update the header. We are all good.
- a rollback is called :
there is nothing else to do than to discard all the pages we copied in
memory
The only pb is to manage the free pages and the deleted pages. The free
pages will be reclaimed using the existing list of free pages, and
eventually from the end of the file if we don't have enough free pages.
This is quite easy. The deleted pages is more complicated to deal with,
as we may have many pages we will free. But again, those deleted pages
can be put into the free page list. Worst case is if we can't update the
copiedPage BTree : we may lose some pages (but that would not corrupt
the backend).
This is what I foresee, I need to go a bit deeper in the algorithm, but
I don't think I'm too far from something that flies.
LMDB already does these steps. The only thing that's actually critical to
proper operation is to defer the header update to the end of the commit, so
keeping a copy of this page in memory is critical. Then commit or rollback is
simply a matter of writing the header back to disk or not, no other page
manipulations will be visible without the header update. The other thing is to
make sure you're only using free pages that can't be seen by any active
transactions.
ATM, I will probaby work on teh insert() operation only, and on a BTree
only.
Thoughts ?
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/