skiplist2 file format - and a question

Bron Gondwana Sat, 01 Oct 2011 07:29:13 -0700

Hi all,

I've been playing around with a file format for skiplist2. I've mostlynailed down the details now, but I want an opinion:

Do you see a case where the alignment of the "const char *" for value issignificant?


The answer to this will determine the amount of padding required per record.


Now - first the justifications:

a) 64 bit pointers, https://bugzilla.cyrusimap.org/show_bug.cgi?id=3494
b) reduce IO for random searches
c) checksums to detect file corruption

And a constraint: don't waste too much space per record.

So the first thing is to consider which fields don't really need to take 32
bits.  I did chop up the header slightly, but then added more space back by

allowing the file to be up to 24 skip levels deep due to the new 64 bitnature.


/* offsets of header files */
enum {
    OFFSET_HEADER = 0,
    OFFSET_VERSION = 16,
    OFFSET_VERSION_MINOR = 18,
    OFFSET_NUM_RECORDS = 20,
    OFFSET_LOGSTART = 24,
    OFFSET_LASTRECOVERY = 32,
    OFFSET_FLAGS = 40,
    OFFSET_CRC32 = 44
};

So 16 bits each for VERSION and VERSION MINOR (currently at 2 and 1respectively, previously 1 and 2)Still only space for 32 bits worth of records, so we're limited at 4billion records per skiplist. I can live with that.


Timestamp and logstart are 64 bit now.

There's space to keep some "flags" - not sure what to put there, but onething I was thinking was "sort by level before key on repack" - whichwould make random key searches more linear, but make "foreach" morerandom. Oh,

and the magic got 4 characters shorter.

OK - now the fields themselves.  First the "type"s.

#define DUMMY 0
#define COMMIT 1
#define ADD 2
#define DELETE 4
/* delete + add */
#define REPLACE 6

These go into 8 bits. REPLACE is an ADD with an extra field for theoffset being deleted.


The header of a record is 64 bits like so:

    /* split this into pieces */

record->vallen = ntohl(*((uint32_t *)(db->map_base +record->offset )));record->keylen = ntohs(*((uint16_t *)(db->map_base + record->offset+ 4)));record->level = (*((char *) (db->map_base + record->offset+ 6)));record->type = (*((char *) (db->map_base + record->offset+ 7)));


But with overflow checks to allow larger values and keys:

    /* check for value overflow */
    if (record->vallen == UINT32_MAX) {
        if (offset + record->len + 8 > db->map_size)
            goto badsize;

record->vallen = ntohtll(*((uint64_t *)(db->map_base +record->offset + record->len)));

        record->len += 8;
    }

    /* check for key overflow */
    if (record->keylen == UINT16_MAX) {
        if (offset + record->len + 8 > db->map_size)
            goto badsize;

record->keylen = ntohtll(*((uint64_t *)(db->map_base +record->offset + record->len)));

        record->len += 8;
    }

Then after comes the DELETE pointer if it's a delete record:

    /* check for delete pointer */
    if (record->type == DELETE || record->type == REPLACE) {
        if (offset + record->len + 8 > db->map_size)
            goto badsize;

record->deloffset = ntohtll(*((uint64_t *)(db->map_base +record->offset + record->len)));

        record->len += 8;
    }

And the skip pointers are BEFORE the key and value:

    for (i = 0; i < record->level; i++) {
        if (offset + record->len + 8 > db->map_size)
            goto badsize;

record->offsets[i] = ntohtll(*((uint64_t *)(db->map_base +record->offset + record->len)));

        record->len += 8;
    }

What this means is that you can follow skip pointers without parsingpast the key and value. Indeed, if you have a really long key andyou're doing strcmp() you may never need to read any more than the firstfew characters of the key, so the following pages won't ever be mappedin if this is not the key you are looking for.


Finally the key and value are accessed:

    record->key = db->map_base + record->offset + record->len;

record->len += record->keylen; /* XXX: unaligned value - is thetradeoff good? */


    record->val = db->map_base + record->offset + record->len;
    record->len += record->vallen;

And the entire record is rounded up to a multiple of 8 bytes.

    /* round up the final figure */
    record->len = roundup(record->len);

    /* and make sure the whole rest fits */
    if (offset + record->len > db->map_size)
>-------goto badsize;

You will notice that all this code uses structures rather than directpointer accesses, and carefully checks at every point that it fitswithin the mmap. This is basically a copy and paste of read_record()from my current dev tree.

So far, this is space-wise equivalent to the old format on average. Theold format was:


chars   field
4       TYPE
4       KEYLEN
%4      KEY
4       VALLEN
%4      VAL
4*level PTR
4       -1 (pointer list finished)

Now PROB is 0.5, so the "average" record will have two pointers, so thatis a total overhead of 6*4 == 24 bytes per message, plus two 4 byteroundings.


The new format so far is:

8       HEADER
8*level PTR
%8      KEY+VAL

With two ptrs, that is 3*8 == 24 bytes! plus a single 8 byte rounding.

So the average size is the same. I am planning to add another 8 byteshowever, as two CRC32s. One over the HEADER and PTR part, and aseparate one over the key and value. This is so that the first CRC32can be re-calculate every rewrite, and the second only when the recordis created. So the final format would be:


8       HEADER
8       VALLENEXT?
8       KEYLENEXT?
8       DELPTR?
8*level NEXTPTR
4       CRC32_HEAD
4       CRC32_VAL
%8      key+val

Or if people decide we need two roundings, then we pad the keylen out aswell, for an average extra cost of 4 bytes per record.


Can anyone see any glaring stupidities in this?

Oh yeah - "dummy" is just a record like everything else, it's the firstone. The "level" counter for the database is of course now just the"level" field in the dummy record.

And a commit record is, on disk, 8 bytes: \0\0\0\0\0\0\0\1 - unlikelyto occur randomly in data!


Bron.

skiplist2 file format - and a question

Reply via email to