Page edited by Emmanuel Lécharny

Changes (2)

Full Content

So we want MVCC implemented on top of a B+Tree. How do we do that ? Let's write down a piece of specifications first.

How does it all work ?

Simple. We do what we call CoW, or Copy on Write. In other words, when we want to update the B+tree, either by adding, modifying or removing elements in it, we don't modifying existing pages, but we copy the modified pages to create new ones. Let's see how it works with some diagrams.

We will use a B+Tree with pages containing only 2 elements. That means a Node can have 3 children at most. (we could have used a larger page size, but for the demonstration, it's easier to have only 2 elements per page). This B+Tree will use single letters as keys, and a value named v{X} which is not interesting atm. We will inject the following set of data : {<E, vE>, <A, vA>, <D, vD>, <I, vI>, <G, vG>, <K, vK>, <H, vH>, <B, vB>} (in this order)

1) Insertion of <E, vE>

This is easy : we create a root page, and we inject the tuple into it. The tree has the revision number 1.

2) Insertion of <A, vA>

Here, we will copy the root page, attribute a new revision, 2, and inject the tuple into the new page. The old page r1 still exists and s pointed internally by the BTree instance

3) Insertion of <D, vD>

Now, the first page is full, we need to split it and create a new level, by creating two leaves containing the three values. The root page now becomes a node, containing only links to the underlying leaves.

We can see that the three versions are still present, and that we have copied all the pages so far.

4) Insertion of <I, vI>

This time, we will only copy 2 pages, and we will point to the unchanged page :

The root node has been modified because its right leaf has been split in two pages, with the pivotal key being moved up to the root. The leaf containing the <A, vA> tuple has remained unchanged, so we still point on it.

This goes on and on like this, and the more tuples we add, the less pages are copied.

Some importants rules

In order to get the tree correctly balanced and easy to use, we have some more constraints to add.

A page contains N element max, N being a power of 2
A leaf must contain at least N/2 elements. If N is odd, then the page should contain (N/1)/2 elements.
The root page can contain only one element
Nodes should not contain any value, except if the node is the root page and has no children
A Node containing M elements has M+1 children
A leaf containing M keys has also M values
A leaf has a link to its sibling leaves
Each page has a revision number and a unique identifier

Nodes and Leaves

We manipulate pages, which may be either Nodes or Leaves. Nodes don't contain any value, while leaves contain values. The nodes are used to route the search down to the leaves.

There is one exception : when the root page is enough to store all the tuples we want to store. It all depends on the number of elements we can store in a single page.

Leaves are also linked all together, with pointers to the next leaf and to the previous leaf. This is important to be able to browse the data in ascendant or descendant order easily, without having to go up and down through the tree.

Spliting a page

When a page becomes full, we have to split it. In order to have all the values down in leaves, this has an impact on how we do split a page. In a simple Btree, it's easier : we get the keys in the page, insert the new key, and search for the pivotal key, which should be in the middle of the page (I will not take into consideration here the fact that a page contain an odd or even number of element). Every keys on the left of the pivotal element will be copied into a new page, and every keys on the right will be copied into a new page too. The pivotal element will then be moved up to the parent page, with the links to the two new parent pages.

So far, so good, except that we have two special cases to deal with.

The split page is the root page

In this case, the pivotal element will become the new root.

The parent page is already full

Adding the pivotal element into this full page will result in a split of this page. We will have to apply the very same algorithm up to the root page or until we have a paret page which is not full.

Merging a page

A deletion will leads to some tuple removal from a leaf. At some point, we may have to deal with a leaf containing less than N/2 elements. In this case, we will have to merge the page with one of its sibling, if the number of elements in the merged page does not exceed the maximum of elements in a page, or if this is not the case, we will have to move one element from one of its sibling up to the parent, and move the old pivotal value down to the current page.

The follow schemas expose the two different cases :

Moving an element from a sibling

Here, we remove an element in a page which contains N/2 elements. The page does not contain enough element, so we grab one element from one of its sibling (here, the left sibling contains N+1 element), and we move it up to the parent, and move the old pivotal value down on the modified page.

Merging two pages in one

Here, the sibling also has N/2 elements, so we can merge the two pages. We will remove the old pivotal element, as it's already present in the right page.

Multiple updates

In some cases, we may want to do multiple modifications within one single revision. In order to allow that, we must expose the revision to the user, so that the modifications are applied to the copied pages, instead of creating a new revision for each new modification, which is expensive.

The idea is to duplicate old pages when needed, but work on the created pages if we have some to update.

When the user is done, he just have to close the revision. We can do that using a mechanism that mimic transactions.

Change Notification Preferences

View Online | View Changes | Add Comment

--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

[CONF] Apache Labs > MVCC specifications

MVCC specifications