Re: How are key values stored in B+ tree nodes on-disk?

Randall Leeds Mon, 23 Jan 2012 00:26:57 -0800

On Sun, Jan 22, 2012 at 04:49, Riyad Kalla <[email protected]> wrote:
> In my ever-growing need to ask the most esoteric file-format questions
> possible, I am curious if anyone can point me in the direction of the
> answer here...
>
> Background
> -----------------
> Consider the _id and _seq B+ indices whose change-paths are appended to the
> end of the Couch data file after every update. Each of the nodes written
> have a basic structure of [key, value] where the "key" is the _id or _seq
> value being indexed, and the "value" is an offset pointer to the next node
> in the tree to seek to (or in the case of a leaf, an offset directly into
> the datafile where the updated record lives).
>
> General
> -----------
> Efficiently searching the keys stored in a node of a B+ tree after it has
> been paged in require (let's assume) that the values are in sorted order;
> then the keys can be binary-searched to find the next offset or block ID
> on-disk to jump to.
>
> To be able to efficiently jump around key values like that, they must all
> be of the same size. For example, let's say we store 133 keys per node
> (assume it is full). That means our first comparison after paging in that
> block should be at the 67th key. The only way I can jump to that position
> is to know every key value is say 16 bytes and then jump forward 67 * 16 =
> 1072 bytes before decoding the key from raw bytes and comparing.
>
> Question
> -------------
> Given the massive potential for document _id values to vary in size, it
> seems impossible to have a hard-coded "key size" that couch utilizes here.
> Even something sufficiently large like 32 or 64 bytes would still run into
> situations where a longer key was being indexed or a *very short* key is
> being indexed and there are multiple magnitudes of wasted space in the
> index.
>
> The few solutions I've found to variable-length keys online require a
> sequential processing of the keys in a node because the keys will be
> written like:
> [key length][key data][offset]
>
> This is certainly doable, but less than optimal.
>
>
> I am curious how CouchDB handles this?
>
> Thank you for any pointers.
>
> -R


The entire node is read into memory and deserialized from the Erlang
term storage format into a list.
The result is still a typical tree search, in which CouchDB can scan
an inner nodes key pointers to determine which child(ren) (if any)
might have a particular key, but within a particular inner node that
scan is linear.

See here: 
https://github.com/apache/couchdb/blob/master/src/couchdb/couch_btree.erl#L660

-Randall

Re: How are key values stored in B+ tree nodes on-disk?

Reply via email to