2013/11/2 Simon Riggs <si...@2ndquadrant.com>: > On 29 October 2013 16:10, Peter Geoghegan <p...@heroku.com> wrote: > >> Presumably someone will get around to implementing a btree index >> insertion buffer one day. I think that would be a particularly >> compelling optimization for us, because we could avoid ever inserting >> index tuples that are already dead when the deferred insertion >> actually occurs. > > That's pretty much what the LSM-tree is.

## Advertising

[ Disclaimer: I have only skimmed the LSM-trees paper rather than read it thoroughly. ] LSM-trees seem to hit a wall when the total amount of data gets big enough, unless one uses “multi-component LSM-trees” (as described in the paper). Having a B-tree with deferred insertion would be similar to having an LSM-tree without the multi-component property. The underlying problem with fast random insertions into a B-tree is that each write of a whole block writes only a small amount of “useful changes” (once the tree gets large enough relative to memory). The “multi-component” concept fixes that. I think that simply combining that concept with normal B-trees is a rather elegant way of (at least theoretically) solving the problem: Definitions: * Define a rather small integer K (with K ≥ 2), that will influence maximum insertion speed (higher K = higher insertion speed), but also look-up time (higher K = slower look-up). * Define some size S “that easily fits in memory.” * F is the fan-out of the B-tree. (If F < K, the algorithm results in slower amortized insertion speed than simply using one single big B-tree if only the amount of blocks read/written are taken into account; it may still be better because of seek times.) * n is the total number of entries in the index. Algorithm: * Start inserting into a small B-tree until it gets size S, then start inserting into a new B-tree until that fills up to size S, etc. * When K such B-trees (of size S) exist, merge them together into one (S × K)-sized B-tree (delete the old ones). * Do this recursively: Once K B-trees of size (K × S) exist, merge them together into one (S × K^2)-sized B-tree, etc. * Let each look-up walk all trees, and return the union of all results. (Note that K B-trees can be merged by simply scanning all of them concurrently, and merging them just like a merge sort merges runs. Also, all B-trees except for the first level (of size S) can be compacted 100% as there is no need to reserve space for further insertions in them.) Insertion speed can be calculated as follows (I disregard seek times to make this easy; it also happens to be the right assumption for SSDs): Assume that a small B-tree (size S) can simply be written out without having to write each block multiple times. Each entry has to be written log_K(n × log_F(n)) times. All blocks written are 100% “useful changes.” Therefore, insertion speed is log_K(n × log_F(n)) times less than raw disk speed. (Note that I also disregard the time needed for reading; This may make everything about twice as slow.) Example: For K = 10, F = 100 (i.e., 80B per entry), blocksize = 8kB, and n = 10^9 (i.e., 80GB of entries), the insertion speed is log_10(10^9 × log_100(10^9)) = log_10(10^9 × 4.5) = ~9.5 times slower than writing the 80GB of index entries sequentially. For the “one single big B-tree” scenario, the insertion speed would be ~F = ~100 times slower than raw disk speed (assuming that all writes but the ones to the leaf-level can be combined). Look-up speed is as follows: Each look-up must look through all B-trees. Assume for simplicity that all trees have the same height as the single B-tree in the “one single big B-tree” case (i.e., which is rather wrong (most are less tall), but seems good enough as an estimate), there are K trees of each size (idem), and there are log_K(n) different sizes. Then, the number of trees to walk is K × log_K(n), and therefore each look-up is K × log_K(n) slower than in the “one single big B-tree” case. Example: (using the same parameters as above) Look-up speed is 10 × log_10(10^9) = 90 times slower (i.e., because there are ~90 trees). Index size: I think (didn’t calculate) that the combined size of the B-trees will be about the same as (a little bit more than) the size of a single big B-tree containing the same entries. I have no idea whether someone already gave this kind of “forest of B-trees” structure a name. Otherwise, I suggest “B-forest” :-). In conclusion, use a “B-forest” when: * The index entries are small (large fan-out). * The insertion throughput is high. * It’s OK for look-ups to be slow. * Extra points when the storage medium has high seek times. Major missing piece in PostgreSQL (I think): * Functionality to merge K indexes into one (bigger) combined index. Nicolas -- A. Because it breaks the logical sequence of discussion. Q. Why is top posting bad? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers