On Sun, Jan 22, 2012 at 04:49, Riyad Kalla <[email protected]> wrote: > In my ever-growing need to ask the most esoteric file-format questions > possible, I am curious if anyone can point me in the direction of the > answer here... > > Background > ----------------- > Consider the _id and _seq B+ indices whose change-paths are appended to the > end of the Couch data file after every update. Each of the nodes written > have a basic structure of [key, value] where the "key" is the _id or _seq > value being indexed, and the "value" is an offset pointer to the next node > in the tree to seek to (or in the case of a leaf, an offset directly into > the datafile where the updated record lives). > > General > ----------- > Efficiently searching the keys stored in a node of a B+ tree after it has > been paged in require (let's assume) that the values are in sorted order; > then the keys can be binary-searched to find the next offset or block ID > on-disk to jump to. > > To be able to efficiently jump around key values like that, they must all > be of the same size. For example, let's say we store 133 keys per node > (assume it is full). That means our first comparison after paging in that > block should be at the 67th key. The only way I can jump to that position > is to know every key value is say 16 bytes and then jump forward 67 * 16 = > 1072 bytes before decoding the key from raw bytes and comparing. > > Question > ------------- > Given the massive potential for document _id values to vary in size, it > seems impossible to have a hard-coded "key size" that couch utilizes here. > Even something sufficiently large like 32 or 64 bytes would still run into > situations where a longer key was being indexed or a *very short* key is > being indexed and there are multiple magnitudes of wasted space in the > index. > > The few solutions I've found to variable-length keys online require a > sequential processing of the keys in a node because the keys will be > written like: > [key length][key data][offset] > > This is certainly doable, but less than optimal. > > > I am curious how CouchDB handles this? > > Thank you for any pointers. > > -R
The entire node is read into memory and deserialized from the Erlang term storage format into a list. The result is still a typical tree search, in which CouchDB can scan an inner nodes key pointers to determine which child(ren) (if any) might have a particular key, but within a particular inner node that scan is linear. See here: https://github.com/apache/couchdb/blob/master/src/couchdb/couch_btree.erl#L660 -Randall
