On Mon, Nov 11, 2013 at 3:03 AM, Heikki Linnakangas
> On 10.11.2013 01:47, Robert Haas wrote:
>> I think we've tried pretty hard to avoid algorithms where the maximum
>> number of lwlocks that must be held at one time is not a constant, and
>> I think we're in for a bad time of it if we start to deviate from that
>> principal. I'm not sure what to do about this problem, but I think
>> locking N levels of the tree at once, where N can be as large as the
>> tree is deep, is probably a bad plan, whether the number of locks
>> required is N or 3N.
> I think I found a solution that accomplishes that. It's actually not that
> Like we currently do, first climb up the tree to check that it's safe to
> delete, ie. the downlink in the first non-empty parent is not the rightmost
> entry. But when we reach the level where the parent is non-empty - I'll call
> that the "topmost" parent - we keep that page locked. The leaf page is kept
> locked while we climb.
> This is enough to plug the race condition. As long as we hold a lock on the
> topmost parent containing the downlink to the branch of pages we're about to
> delete, it cannot become the rightmost entry. And as long as we hold a lock
> on the leaf page, no new insertions can happen on any of the internal pages
> in the branch, as insertions to internal pages only happen when a child is
> split. However, the rest of the algorithm needs to be slightly modified, as
> we cannot re-lock the lower-level pages until we release the lock on the
> topmost page, to avoid deadlock.
> So at this point, we hold two locks: the leaf page, and the topmost parent
> containing the downlink to the branch we're deleting. Next, we remove the
> downlink from the topmost parent, and mark the leaf page as half-dead in one
> atomic operation. Also, a pointer to the highest internal page in the branch
> we're deleting - the one the removed downlink pointed to - is put on the
> leaf page. We can now release the locks.
> At this point, searches and inserts work fine. The leaf page has been marked
> as half-dead, so any insertions to the deleted page's keyspace will go to
> its right sibling. The downlink is to the top of the branch is gone, so even
> if the right sibling is split many times, any keys in the transferred
> keyspace that propagate to the higher levels won't be out-of-order.
> All that is left is to unlink the all the lingering pages in the branch
> we're deleting from their left and right siblings. This can be done at any
> later time, and if we error out or crash for some reason, next vacuum that
> comes along can finish the job. This is done one level at a time. Lock the
> leaf page, and the internal page the leaf page points to, and the internal
> page's left and right siblings (in the right order, not this order). Change
> the left and right sibling's right- and left-links, mark the internal page
> as deleted, and update the pointer in the leaf page to point to the child of
> the deleted internal page. Then recurse to the child, until we reach the
> leaf level.
> This has the nice extra property that we don't need the incomplete-action
> tracking in WAL recovery. I'd like to get rid of that anyway.
I am not very knowledgeable of this code, but at least with my limited
knowledge I can't spot any problems with that approach. The only
thing that seems a little unfortunate is that the next vacuum may need
to finish cleaning things up; I thought at one point you were trying
to get rid of the amvacuumcleanup stuff. But maybe I'm
misremembering, and anyway it still seems like a step forward.
> I'm not sure what to do about stable branches. This could be back-patched,
> with the caveat that this introduces new WAL record types so standbys need
> to be upgraded before the master. But given the lack of field reports in the
> ten years this race condition has existed, I'm not sure it's worth the
In the absence of at least one (and maybe several) reports from the
field, -1 for back-patching this. At this point, it seems like far
more people will be confused by the need to upgrade things in order
than will benefit from the fix.
The Enterprise PostgreSQL Company
Sent via pgsql-hackers mailing list (firstname.lastname@example.org)
To make changes to your subscription: