On Sat, Jan 18, 2014 at 11:45 AM, Kevin Grittner <kgri...@ymail.com> wrote: > Heikki Linnakangas <hlinnakan...@vmware.com> wrote: > >> Here's a patch implementing this scheme.
I thought I'd weigh in here too, since this is closely tied to the page split patch, which is dependent on this patch. > The basic approach is sound. The papers on which we based our > btree implementation did not discuss entry deletion, and our > current approach introduces new possible states into the on-disk > index structure which are not covered by the papers and are not > always handled correctly by the current code. Lehman and Yao don't discuss deletion. But V. Lanin and D. Shasha. do, in the paper "A Symmetric Concurrent B-Tree Algorithm"[1], as the README notes - we currently follow a "simplified" version of that. Apparently a number of people proposed solutions to address the omission of L&Y, as one survey of those approaches notes [2]. One of the approaches appearing in that survey is L&S. The classic L&Y algorithm only has right-links, and not left-links. The same applies to L&S. But I'd describe this patch as bringing what we do closer to L&S, which is more or less a refinement of B-Link trees (that is, L&Y B-Trees). L&S does have a concept of an outlink, crucial to their "symmetric deletion approach", which is something that we don't need, because we already have left-links (we have them primarily to support backwards index scans). L&S says "In general, the link technique calls for processes that change sections of a data structure in a way that would cause other processes to fail to provide a link to another section of the data structure where recovery is possible". So loosely speaking we're doing a "conceptual _bt_moveleft()" for deletion as the inverse of a literal _bt_moveright for insertion. L&S says "a deletion needs no more than two write locks at a time during its ascent" (the descent is just a garden variety _bt_search() insertion scankey descent). However, we currently may obtain as many as 4 write locks for "page deletion". As the README notes: """ To delete an empty page, we acquire write lock on its left sibling (if any), the target page itself, the right sibling (there must be one), and the parent page, in that order. """ We obtain 2 write locks in the first phase (on target and parent). In the second phase, we do what is described above: lock 4 pages in that order. However, the reason for this difference from the L&S paper is made clear in "2.5 Freeing Empty nodes", which kind of cops out of addressing how to *unlink* empty/half-dead pages, with a couple of brief descriptions of approaches that others have come up with that are not at all applicable to us. For that reason, might I suggest that you change this in the README: + Page Deletion + ------------- I suggest that it should read "Page Deletion and Unlinking" to emphasize the distinction. > The general feeling is that this patch is not suitable for > back-patching. I agree with this, but this is a bug which could > lead to index corruption which exists in all supported branches. > If nobody can suggest an alternative which we can back-patch, we > might want to reconsider this after this has been in production > releases for several months. That was what I thought. Let's revisit the question of back-patching this when 9.4 has been released for 6 months or so. > Other than that, I have not found any problems. Incidentally, during my research of these techniques, I discovered that Johnson and Shasha argue that "for most concurrent B-tree applications, merge-at-empty produces significantly lower restructuring rates, and only a slightly lower space utilization, than merge-at-half". This is covered in the paper "B-trees with inserts, deletes, and modifies: Why Free-at-empty is Better than Merge-at-half" [3]. I was pleased to learn that there was a sound, well regarded theoretical basis for that behavior, even if the worst-case bloat can be unpleasant. Though having said that, that paper doesn't weigh what we do in the event of non-HOT updates, and that could change the equation. That isn't an argument against merge-at-empty; it's an argument against the current handling of non-HOT updates. Currently, the comments above _bt_doinsert() say at one point: * This routine is called by the public interface routines, btbuild * and btinsert. This is obsolete; since commit 89bda95d, only the insert public interface routine calls _bt_doinsert(). Perhaps fix this in passing. [1] L&S: https://archive.org/stream/symmetricconcurr00lani/symmetricconcurr00lani_djvu.txt [2] http://www.dtic.mil/get-tr-doc/pdf?AD=ADA232287&Location=U2&doc=GetTRDoc.pdf (Chapter 3, "The Coherent Shared Memory Algorithm") [3] http://cs.nyu.edu/shasha/papers/bjournal.ps -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers