Page edited by Emmanuel Lécharny

Changes (2)

Full Content

One of the major burden and cost of managing a BTree efficiently is when it comes to write on a disk. We don't really care too much if the in-memory operation on a BTree are as fast as possible, considering that writing on a physical support is three order of magnitude slower than writing in memory.

Physical support

Now, there are different aspects we should also consider when we think about IOs : the data will be written on a physical memory, but depending on the kind of physical device we are using, the performances may vary a lot. We basically have three kind of physical support :

spinning hard disks
SSD
NAS

Spinning hard disk

There are a lot of different disks, but basically we will consider that the rotation speed is one of the most important factor. Rotation speed dictates the seek time (the time it takes to move the head at the right place on disk). The following table shows the impact on a seek for various speed rotation :

Speed	Latency
5400 rpm/s	5ms
7200 rpm/s	4ms
10000 rpm/s	3ms
15000 rpm/s	2ms

Add to this latency the time it takes to move the head to the right sector, and the time it takes to transfer the data from the disk to the memory.

Anyway, enough to know that in order to improve the performance of a BTree, we should minimize the disk IO, and also have all the data being contiguous on disk to avoid time consuming seek operations.

SSD

SSD are working totally differently, and one other factor has to be put into the big picture : writing data on a SSD are destructive in the long run, and when we modify something on a SSD, we will write a block, not a page (a block can be quite big, something like 2Mb). So if we can differ the write until we get enough data to write into a block, that would be better.

NAS

We could also think about a solution where the data are pushed to a NAS, as it will have different kind of performances.

Relation between the in-memory B+Tree and physical support

A MVCC Btree in-memory is a good thing, but at some point we are limited by two factors :

the memory size is limited
we want to be able to get back the data when the process is stopped and restarted.

There are two ways to mitigate those constraints :

the in-memory data are backed on disk, but we keep all the BTree in memory
the Btree data are written back to disk as soon as we modify the BTree

We will describe the two strategies in the next paragraphs.

Persistent In-Memory BTree

The idea is to keep all the btree in memory, while saving the newly added/removed data on disk, so that we can reload the bTree at startup. This is done using a Journal which is flush on disk periodically by a separate thread. We may still lose some data, but once the journal is written on disk, we can restore the BTree from what we have on disk.

In order to keep it simple, we don't modify a file containing some data : we create a new one with the current tree content. When we load this file into the in-memory BTree, we then have to apply the journal to get back into the same state than when we stopped the process.

Change Notification Preferences

View Online | View Changes | Add Comment

--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

[CONF] Apache Labs > IO operations

IO operations