A "Threeskip" database format (because there can never be too many skips)

Bron Gondwana Sun, 28 Jun 2020 06:58:40 -0700

I've been thinking about this one for a very long time. It's still not going to 
be as excellent as zeroskip, but it has the advantage that it's still a single 
file, and it fixes two of the three defects of twoskip:


Those three defects are:
1) poor write locality - updates happen throughout the file
2) long repack freeze gives unpredictable performance
3) no MVCC reads - writers block all readers.

Threeskip will fix two of those! Numbers 2 and 3.

For comparison, here's the twoskip ondisk format:

* HEADER: 64 bytes
* magic: 20 bytes: "4 bytes same as skiplist" "twoskip file\0\0\0\0"
* version: 4 bytes
* generation: 8 bytes
* num_records: 8 bytes
* repack_size: 8 bytes
* current_size: 8 bytes
* flags: 4 bytes
* crc32: 4 bytes
*
* RECORDS:
* type 1 byte
* level: 1 byte
* keylen: 2 bytes
* vallen: 4 bytes
* <optionally: 64 bit keylen if keylen == UINT16_MAX>
* <optionally: 64 bit vallen if vallen == UINT32_MAX>
* ptrs: 8 bytes * (level+1)
* crc32_head: 4 bytes
* crc32_tail: 4 bytes
* key: (keylen bytes)
* val: (vallen bytes)
* padding: enough zeros to round up to an 8 byte multiple


Threeskip is almost exactly the same, with three types (and of course with 
'threeskip file\0\0\0' in the header)

RECORD
NEWVALUE
DELETE

RECORD:
* type 1 byte
* level: 1 byte
* keylen: 2 bytes
* vallen: 4 bytes
* <optionally: 64 bit keylen if keylen == UINT16_MAX>
* <optionally: 64 bit vallen if vallen == UINT32_MAX>
** latestptr: 8 bytes (new)*
* ptrs: 8 bytes * (level+1)
* crc32_head: 4 bytes
* crc32_tail: 4 bytes
* key: (keylen bytes)
* val: (vallen bytes)
* padding: enough zeros to round up to an 8 byte multiple

NEWVALUE:
* type: 1 byte
* padding: 3 bytes
* vallen: 4 bytes
* <optionally: 64 bit vallen if vallen == UINT32_MAX>
** prevptr: 8 bytes*
* crc32_head: 4 bytes
* crc32_tail: 4 bytes
* val: (vallen bytes)
* padding: enough zeros to round up to an 8 byte multiple +1

DELETE:
* type: 1 byte
* ZEROS: 7 bytes
* prevptr: 8 bytes
* crc32_head: 4 bytes
* crc32_tail: 4 bytes (ZERO)

This gives us MVCC! When you start a read transaction, the transaction includes 
the length of the file at the time of transaction start. The reader holds the 
same filehandle open, meaning it can keep reading even after a repack. If it 
reads past the transaction length, it just moves on to the next record. If it 
follows the latestptr past the transaction length, it just follows the prevptr 
back until it's at the latest value within the transaction.

(DELETE already had a prevptr, but it was less powerful)

*DUAL LOCKING
*

With this comes a second MVCC help. We use two different writelocks, which can 
be achieved with ranged flock within the header. Lock number 1 is held for the 
entire transaction and avoids any other writer. Lock number 2 is ONLY held for 
the duration of writing a single record and updating the related pointers. 
Readers take a shared lock on number 2 only, which allows them to guarantee 
clean reads.

*NON-BLOCKING REPACK*

The repack process can take an MVCC read and stream those records to a new file 
$DB.new (on which it holds an exclusive lock to avoid racing repacks), and then 
it can replay the log from there. Even the log replay can be done with read 
locks if the log length is over a certain amount (though this can lead to 
starvation, so should be a limited number of times) . The final fsync and 
rename needs to be done with a writelock on the source file of course.

Reading the log, if we hit a NEWVALUE or DELETE, we follow the prevptrs back to 
find out the key that it references.

It might be necessary to have a "repackd" which does the actual repacking, or 
some other way of a process saying "I don't mind waiting on the repack" because 
otherwise the repack freeze will still be visible to the current client, even 
if other readers and writers can continue.

*STORAGE COST COMPARISON**
*

This adds an extra 8 bytes to the overhead for every CREATE record. The DELETE 
record is the same size. The NEWVALUE record no longer has the cost of an 
extract copy of the key in it. So it will use slightly more storage on newly 
repacked databases.

(The base-level average overhead of a twoskip record is 40 bytes, this makes it 
48)

*COMPUTATIONAL AND IO COST COMPARISON**
*

Calculating "repack_size" will be a little more costly, but not too bad.

Replacing values will be much cheaper because we don't need to update pointers. 
In theory we could even "row lock" or "gap lock" if we wanted other concurrency 
levels.

*CRC ALGORITHM**
*

Since we have a choice with a new format, clearly we'd go with something fast! 
CRC32c with hardware acceleration all the way.

*CRASH SAFETY AND THE OTHER TWOSKIP BENEFITS*

All still there just as much as ever. Of course, write locality is helped 
slightly (overwrite and delete only touch the current record) but it's still 
the issue that will make zeroskip a better DB format.

Bron.

--
 Bron Gondwana, CEO, Fastmail Pty Ltd
 br...@fastmailteam.com

A "Threeskip" database format (because there can never be too many skips)

Reply via email to