On Mon, Jan 23, 2012 at 19:49, Paul Davis <[email protected]> wrote: > On Mon, Jan 23, 2012 at 9:28 PM, Riyad Kalla <[email protected]> wrote: >> <resending, Gmail is telling me this never sent...> >> >> >> Randall and Paul, I appreciate the explanation and links to the exact spots >> in code. This helped a lot. >> >> I've been reading up on alternative database impls; InnoDB, for example, >> caps the key length at some upper bound and then truncates the key and puts >> the remainder in an overflow table; naturally this requires 1 more random >> I/O every time you match a long key. At the extreme case (where every key >> doesn't fit and parts of it are in the overflow table) I think you end up >> with key-table idea you were toying with Paul, or something related. >> >> Paul, I feel your pain re: alternative approaches to this. If you can do >> away with RANGE queries, moving to MD5 or SHA hashes (or something even >> faster like MurmurHash) is a nice way to get consistently sized keys which >> gives you the advantage of paging in a block in raw bytes and only >> deserializing the 16 or 20-byte chunks in the positions of a standard >> binary-search until you find what you want instead of the whole node. >> >> I *really* want to find a way to make that work, but people seem to like >> RANGE queries. >> >> I asked this question on SO and a gentleman suggested a header at the >> beginning of the block that actually lists the offsets of all the keys in >> the node. For node/block sizes of 32KB of smaller (which seem reasonable), >> you could safely get away with a 16-bit int for storing the offsets; >> assuming a typical key size, I imagine you wouldn't have more than 100 or >> 200 keys in a node; assume 150, that is 300 bytes worth of offsets at the >> beginning of the block. So we spend 7% (assuming 4096kb block size) of the >> block providing offsets to help make key searching faster. If you were >> willing to cap blocks to no more than 255 keys you could use 8-bit offset >> values (meaning any key smaller than 8 bytes would get you in trouble). >> >> A limitation to this idea though is that when you need to overflow a single >> block because you are indexing so many or such large keys, you now end up >> with two blocks next to each other with 1 of them having no header. >> >> You can fix this by having a hard-set-block-size for the DB (say 4KB) that >> never changes and can never be configured. That way every block, >> everywhere, has a very tiny "type" header; maybe 8-bit. If the "normal" bit >> is set, then that block would have an offset list at the beginning of it >> before the key data. If the "overflow" bit was set, than that block is a >> continuation of the block before it, and right after the 8-bit block header >> is raw key data that needs to be considered. >> >> Just an idea, I am sure there are more elegant ways to solve this. I am >> more curious if there is something here... something interesting and worth >> considering, or if I am just enjoying the smell of my own farts over here ;) >> >> -R >> > > I would say that the important constraints in changing this is that > you can't change the existing behavior without a *very* good reason. > For instance, there's currently no limit on the key length in the > btree. Changing this (even to something "long" like 65535 bytes) is > still a behavior change. And removing range queries is pretty much out > without changing a major API endpoint. > > That said I invite you to play with this code. I've spent a good deal > of time trying to make it better but kept coming up short when > measuring performance (and its possible that my measurements weren't > completely representative either). If you come up with anything give > me a holler and I'd love to go through and see what you come up with.
Hear, hear! Play with the code!
