Hi guys, we have had long discussions with Howard on LMDB about the release of pages that are not anymore in use. This is a critical part of a MVCC based B-tree, as each update will copy many pages, and we want to re-use the copied pages.
The problem is that those copied pages may be still in use by older readers. Let's quickly explain how it all works. At Ti, we are updating the B-tree, and that creates many pages by copying pages from previous revisions. We will note those new pages based on the level in the B-tree they were stored in : P[0, i-1] will be for the root page, on level 0, and this page previous version was k (and as it's the root page, k is obviously one version earlier than the current revision). P[k, a] will be for a page at level k, and with a previous revision 'a' (where a <= i-1). One more important point is that a copied page may be split in two new pages (if teh previous page was full), as we may copy the content of two old pages into one single new page (when the two old pages contain only N/2 elemnts and we removed one element in one of the two pages). Last, not least, we may have to remove one level, if the root page is removed. All in all, what is important here is to *know* which pages have been copied during the update operation, because this is those pages that will be removed at some point. Now, assuming we keep a track of all the copied pages for a revision R, we can also assume we can remove all those pages when this revision R is not anymore in use (ie no reader thread is using R *and* no other reader thread is using a revision below R). This is what LMDB does, AFAICT. Can we go a bit further, and get rid of copied pages when a reader is using a revision < R ? I believe so. Let say that for a revision R, we have copied a page P[k, c], and let say we have a reader using a revision P < c (obviously, if a reader uses a revision > c, we can't free P[k, c], because it's fully part of the B-tree used by reader P). If c > P, that means we have copied a page which revison was < P at some point, so we can now get rid of this copy, no other reader will use it. case 1 : P is newer than the copied page : P R | | v v o--------[c]---------[P]---------[n]------> time ^ | /| | | / +--> we create a new revision, which copy page [c] and create page [n] .-----------|--------. | +--------------> Revision P is using page [c] Here, we can't get rid of page [c], as it's still part of revision P B-tree case 2 : P is older than the copied page P Q R | | | v v v o--------[P]---------[c]---------[n]------> time | ^ /| | | / +--> we create a new revision, which copy page [c] and create page [n] | .--------. | +--------------> Revision P Here, an update at revision Q has created a page [c] which has been copied by the update at revision R. We can get rid of page [c] because no other reader will ever use it. That is all clear and easy. Now, the real pb is how do we determinate the pages we can free based on the copied pages at revision R ? At revision R, we know which pages we have copied, and each page has a unique revision number. That allows us to know which pages to free quite immediately : we just free all the pages which revision is newer than the most recent reader revision. For instance, we can immediately get rid of all the root pages for any revision not in use, because this root page is immediately copied when we do an upadate. That is for any update being done, when we don't have recent readers (ie, if we do many updates, this algorithm works very well). But what if we have more than one pending reader ? There is little we can do here, we have to wait for one of the reader to be released, because this is where we will be able to free many pages. Ok, let's see what happens when we free a reader revision now. We know which pages this reader is holding, it's the list of copied pages by the update which occured at the same revision. When we release this reader, all those copied pages has to bec checked to see if they are still in use. M P Q R | | | | v v v v o---[M]---[c]----[d]----[P]-----[Q]--[n]------> time | ^ ^ | / /| | | | | / / | | | .------|----. / +--> we create a new revision, which copy page [c] and create page [n] | | | / | .-------------|-------. | | | +--------------> Revision P is using page [c] | +---------------> Revision M Here, we will release the reader P. We should expect that every page copied between M and P can now be released (in the graphic, revison Q has copied page [d], which is still in hold, because revision P is using it). So when we release P, we know we should free pages [c] and [d]. The difficulty here is to list all the pages that are in hold by revision P, and thi sis a costly process : - check every revision > P which have copied pages which revision is above M and below P. In order to do that we must have a data structure that holds all the reader in use, order from the oldest to the newest, and we must have a data structure keeping a track of all the pages copied by each revision. We have both, but the second on is not in memory (well, we can't assume it's in memory, as this data structure is stored in a B-tree itself). So basically, it's a matter of browsing the copied-pages B-tree looking for all the revisions between [M] and [R], and for each of those revisions, we can feee of all the copied pages which revision is > M. Not exactly inexpensive, but still acceptable, IMO. Wdyt ? Emmanuel Lécharny