Hi all,
I've been playing around with a file format for skiplist2. I've mostly
nailed down the details now, but I want an opinion:
Do you see a case where the alignment of the "const char *" for value is
significant?
The answer to this will determine the amount of padding required per record.
Now - first the justifications:
a) 64 bit pointers, https://bugzilla.cyrusimap.org/show_bug.cgi?id=3494
b) reduce IO for random searches
c) checksums to detect file corruption
And a constraint: don't waste too much space per record.
So the first thing is to consider which fields don't really need to take 32
bits. I did chop up the header slightly, but then added more space back by
allowing the file to be up to 24 skip levels deep due to the new 64 bit
nature.
/* offsets of header files */
enum {
OFFSET_HEADER = 0,
OFFSET_VERSION = 16,
OFFSET_VERSION_MINOR = 18,
OFFSET_NUM_RECORDS = 20,
OFFSET_LOGSTART = 24,
OFFSET_LASTRECOVERY = 32,
OFFSET_FLAGS = 40,
OFFSET_CRC32 = 44
};
So 16 bits each for VERSION and VERSION MINOR (currently at 2 and 1
respectively, previously 1 and 2)
Still only space for 32 bits worth of records, so we're limited at 4
billion records per skiplist. I can live with that.
Timestamp and logstart are 64 bit now.
There's space to keep some "flags" - not sure what to put there, but one
thing I was thinking was "sort by level before key on repack" - which
would make random key searches more linear, but make "foreach" more
random. Oh,
and the magic got 4 characters shorter.
OK - now the fields themselves. First the "type"s.
#define DUMMY 0
#define COMMIT 1
#define ADD 2
#define DELETE 4
/* delete + add */
#define REPLACE 6
These go into 8 bits. REPLACE is an ADD with an extra field for the
offset being deleted.
The header of a record is 64 bits like so:
/* split this into pieces */
record->vallen = ntohl(*((uint32_t *)(db->map_base +
record->offset )));
record->keylen = ntohs(*((uint16_t *)(db->map_base + record->offset
+ 4)));
record->level = (*((char *) (db->map_base + record->offset
+ 6)));
record->type = (*((char *) (db->map_base + record->offset
+ 7)));
But with overflow checks to allow larger values and keys:
/* check for value overflow */
if (record->vallen == UINT32_MAX) {
if (offset + record->len + 8 > db->map_size)
goto badsize;
record->vallen = ntohtll(*((uint64_t *)(db->map_base +
record->offset + record->len)));
record->len += 8;
}
/* check for key overflow */
if (record->keylen == UINT16_MAX) {
if (offset + record->len + 8 > db->map_size)
goto badsize;
record->keylen = ntohtll(*((uint64_t *)(db->map_base +
record->offset + record->len)));
record->len += 8;
}
Then after comes the DELETE pointer if it's a delete record:
/* check for delete pointer */
if (record->type == DELETE || record->type == REPLACE) {
if (offset + record->len + 8 > db->map_size)
goto badsize;
record->deloffset = ntohtll(*((uint64_t *)(db->map_base +
record->offset + record->len)));
record->len += 8;
}
And the skip pointers are BEFORE the key and value:
for (i = 0; i < record->level; i++) {
if (offset + record->len + 8 > db->map_size)
goto badsize;
record->offsets[i] = ntohtll(*((uint64_t *)(db->map_base +
record->offset + record->len)));
record->len += 8;
}
What this means is that you can follow skip pointers without parsing
past the key and value. Indeed, if you have a really long key and
you're doing strcmp() you may never need to read any more than the first
few characters of the key, so the following pages won't ever be mapped
in if this is not the key you are looking for.
Finally the key and value are accessed:
record->key = db->map_base + record->offset + record->len;
record->len += record->keylen; /* XXX: unaligned value - is the
tradeoff good? */
record->val = db->map_base + record->offset + record->len;
record->len += record->vallen;
And the entire record is rounded up to a multiple of 8 bytes.
/* round up the final figure */
record->len = roundup(record->len);
/* and make sure the whole rest fits */
if (offset + record->len > db->map_size)
>-------goto badsize;
You will notice that all this code uses structures rather than direct
pointer accesses, and carefully checks at every point that it fits
within the mmap. This is basically a copy and paste of read_record()
from my current dev tree.
So far, this is space-wise equivalent to the old format on average. The
old format was:
chars field
4 TYPE
4 KEYLEN
%4 KEY
4 VALLEN
%4 VAL
4*level PTR
4 -1 (pointer list finished)
Now PROB is 0.5, so the "average" record will have two pointers, so that
is a total overhead of 6*4 == 24 bytes per message, plus two 4 byte
roundings.
The new format so far is:
8 HEADER
8*level PTR
%8 KEY+VAL
With two ptrs, that is 3*8 == 24 bytes! plus a single 8 byte rounding.
So the average size is the same. I am planning to add another 8 bytes
however, as two CRC32s. One over the HEADER and PTR part, and a
separate one over the key and value. This is so that the first CRC32
can be re-calculate every rewrite, and the second only when the record
is created. So the final format would be:
8 HEADER
8 VALLENEXT?
8 KEYLENEXT?
8 DELPTR?
8*level NEXTPTR
4 CRC32_HEAD
4 CRC32_VAL
%8 key+val
Or if people decide we need two roundings, then we pad the keylen out as
well, for an average extra cost of 4 bytes per record.
Can anyone see any glaring stupidities in this?
Oh yeah - "dummy" is just a record like everything else, it's the first
one. The "level" counter for the database is of course now just the
"level" field in the dummy record.
And a commit record is, on disk, 8 bytes: \0\0\0\0\0\0\0\1 - unlikely
to occur randomly in data!
Bron.