On 09.11.2013 18:24, Tom Lane wrote:
Heikki Linnakangas <hlinnakan...@vmware.com> writes:
2. The second-simplest solution I see is to keep locked the whole chain
of pages that will be deleted, and delete all of them as one atomic
WAL-logged operation. Ie. the leaf page, and all the parent pages above
it that will become half-dead, and the parent of the last half-dead page.
This would be more tenable if we could put a known limit on the number of
pages to be changed at once. I'm not too awake at the moment, so maybe
this is not possible, but could we simply decide in advance that we won't
propagate page deletion up more than a-small-number of levels? It'd
require allowing a page to remain half-dead until some other vacuum comes
along and fixes it, but that seems OK.
I don't feel comfortable leaving pages in half-dead state. Looking back
at the archives, your original design ten years ago did exactly that
but that turned out to be a bad idea
Mind you, even if we check that the half-dead page at some upper level
is eventually deletable because it's not the rightmost child, it might
become the rightmost child if we don't hold the lock so that the next
vacuum cannot delete it, and we're back to square one.
We could just punt if more than X pages would need to be changed. That
would mean that we never delete pages at the top (h - X) levels of the
tree. In practice that should be fine if X is high enough.
As a data point, GIN list page deletion holds 16 pages locked at once
(GIN_NDELETE_AT_ONCE). Normally, a 16-level deep B-tree is pretty huge.
As another data point, in the worst case the keys are so wide that only
2 keys fit on each B-tree page. With that assumption, the tree can be at
most 32 levels deep if you just insert into it with no deletions, since
MaxBlockNumber ~= 2^32 (I may be off by one in either direction, not
sure). Deletions make it more complicated, but I would be pretty
surprised if you can construct a B-tree tallers than, say 40 levels.
Things gets tricky if shared_buffers is very small; with
shared_buffers=16, you certainly can't hold more than 16 buffers pinned
at once. (in fact, I think ginfast.c already has a problem with that...)
Sent via pgsql-hackers mailing list (email@example.com)
To make changes to your subscription: