Hi guys,

we have had long discussions with Howard on LMDB about the release of
pages that are not anymore in use. This is a critical part of a MVCC
based B-tree, as each update will copy many pages, and we want to re-use
the copied pages.

The problem is that those copied pages may be still in use by older
readers. Let's quickly explain how it all works.

At Ti, we are updating the B-tree, and that creates many pages by
copying pages from previous revisions. We will note those new pages
based on the level in the B-tree they were stored in : P[0, i-1] will be
for the root page, on level 0, and this page previous version was k (and
as it's the root page, k is obviously one version earlier than the
current revision). P[k, a] will be for a page at level k, and with a
previous revision 'a' (where a <= i-1).

One more important point is that a copied page may be split in two new
pages (if teh previous page was full), as we may copy the content of two
old pages into one single new page (when the two old pages contain only
N/2 elemnts and we removed one element in one of the two pages). Last,
not least, we may have to remove one level, if the root page is removed.
All in all, what is important here is to *know* which pages have been
copied during the update operation, because this is those pages that
will be removed at some point.

Now, assuming we keep a track of all the copied pages for a revision R,
we can also assume we can remove all those pages when this revision R is
not anymore in use (ie no reader thread is using R *and* no other reader
thread is using a revision below R). This is what LMDB does, AFAICT.

Can we go a bit further, and get rid of copied pages when a reader is
using a revision < R ? I believe so.

Let say that for a revision R, we have copied a page P[k, c], and let
say we have a reader using a revision P < c (obviously, if a reader uses
a revision > c, we can't free P[k, c], because it's fully part of the
B-tree used by reader P). If c > P, that means we have copied a page
which revison was < P at some point, so we can now get rid of this copy,
no other reader will use it.

case 1 : P is newer than the copied page :

                      P           R
                      |           |
                      v           v
o--------[c]---------[P]---------[n]------> time
          ^           |          /|
          |           |         / +--> we create a new revision, which
copy page [c] and create page [n]
          .-----------|--------.         
                      |
                      +--------------> Revision P is using page [c]
                      
Here, we can't get rid of page [c], as it's still part of revision P B-tree

case 2 : P is older than the copied page

          P           Q           R
          |           |           |
          v           v           v
o--------[P]---------[c]---------[n]------> time
          |           ^          /|
          |           |         / +--> we create a new revision, which
copy page [c] and create page [n]
          |           .--------.         
          |
          +--------------> Revision P

Here, an update at revision Q has created a page [c] which has been
copied by the update at revision R. We can get rid of page [c] because
no other reader will ever use it.

That is all clear and easy. Now, the real pb is how do we determinate
the pages we can free based on the copied pages at revision R ? At
revision R, we know which pages we have copied, and each page has a
unique revision number. That allows us to know which pages to free quite
immediately : we just free all the pages which revision is newer than
the most recent reader revision.

For instance, we can immediately get rid of all the root pages for any
revision not in use, because this root page is immediately copied when
we do an upadate.

That is for any update being done, when we don't have recent readers
(ie, if we do many updates, this algorithm works very well).

But what if we have more than one pending reader ? There is little we
can do here, we have to wait for one of the reader to be released,
because this is where we will be able to free many pages.

Ok, let's see what happens when we free a reader revision now.

We know which pages this reader is holding, it's the list of copied
pages by the update which occured at the same revision. When we release
this reader, all those copied pages has to bec checked to see if they
are still in use.


     M                   P       Q    R
     |                   |       |    |
     v                   v       v    v
o---[M]---[c]----[d]----[P]-----[Q]--[n]------> time
     |     ^      ^      |      /    /|
     |     |      |      |     /    / |
     |     |      .------|----.    /  +--> we create a new revision,
which copy page [c] and create page [n]
     |     |             |        /
     |     .-------------|-------.         
     |                   |
     |                   +--------------> Revision P is using page [c]
     |
     +---------------> Revision M
    
Here, we will release the reader P. We should expect that every page
copied between M and P can now be released (in the graphic, revison Q
has copied page [d], which is still in hold, because revision P is using
it). So when we release P, we know we should free pages [c] and [d]. The
difficulty here is to list all the pages that are in hold by revision P,
and thi sis a costly process :
- check every revision > P which have copied pages which revision is
above M and below P.

In order to do that we must have a data structure that holds all the
reader in use, order from the oldest to the newest, and we must have a
data structure keeping a track of all the pages copied by each revision.
We have both, but the second on is not in memory (well, we can't assume
it's in memory, as this data structure is stored in a B-tree itself).

So basically, it's a matter of browsing the copied-pages B-tree looking
for all the revisions between [M] and [R], and for each of those
revisions, we can feee of all the copied pages which revision is > M.
Not exactly inexpensive, but still acceptable, IMO.

Wdyt ?

Emmanuel Lécharny

Reply via email to