I started reading the spgist vacuum code last night, and didn't like it
at all.  I found a number of smaller issues, but it seems to me that
the design is just fundamentally wrong.  Unless I'm misunderstanding it,
the approach is to recursively traverse the tree in sort of the same way
that a full-index search would do, except that it continues to hold lock
on the immediate parent page of whichever page is currently being
visited, so that it can update the downlink if it has to delete the
first leaf tuple of a chain.  I've got several complaints about that:

1. Holding exclusive locks on upper pages while we visit
possibly-hundreds of leaf pages seems pretty bad from a concurrency
standpoint.  It doesn't hold locks all the way up to the root, so it's
not a complete loss, but even a first-level inner page can have a lot of
children.

2. I do not trust this code to visit every leaf page (which, if it
failed to do so and thereby missed deleting a heap reference, would
result in a silently corrupt index).  Unlike a regular index search,
which makes a list of lower-level targets while examining an inner tuple
just once, this code depends on the assumption that it can reacquire
lock on an upper page and then resychronize itself with whatever's been
done meanwhile to the inner tuple it was looking at.  I don't trust
that.  The mechanism that's being used relies on page LSN having been
changed if anything was changed on the page, which does not work in
unlogged tables.  Furthermore, if the inner tuple did change, it has to
rescan *everything* that the inner tuple leads to.  That's horrendously
inefficient, and it's not even clear that it would ever terminate in the
face of a constant stream of concurrent updates.

3. Even if it all worked, it would be slow as can be, because it
requires a whole lot of nonsequential accesses to the index.  For
instance, the same leaf page might be visited many times (once for
each leaf chain on the page), and not necessarily close together either.

Also, the code doesn't seem to perform any cleanup at all on inner
pages.  I am not expecting it to try to get rid of childless inner
tuples, but surely it ought to at least convert old redirects to dead
and get rid of dead tuples at the end of the page, much as for leaf
pages?

What I'm envisioning doing instead is making a single sequential scan
over the entire index.  On both leaf and inner pages we could clean
redirects and remove end dead tuples.  On leaf pages we'd also check for
tuples to be deleted.  There's no real difficulty in removing deleted
tuples as long as they are not at the heads of their chains.  (The code
would have to reverse-engineer which tuples are at the heads of their
chains, but that's easily done while scanning the page; we just make a
reverse-lookup map for the nextOffset pointers.)  If we have to delete a
tuple at the head of its chain, but there's still at least one live
tuple in its chain, we could reshuffle the tuples to bring another live
tuple to that same offset number, thereby not invalidating the parent
downlink.  The only hard part is what to do when we delete all members
of a chain: we can't reset the parent downlink because we do not know
where it is.  What I propose to do in that case is install a variant
form of dead tuple that just indicates "this chain is empty".  One way
to represent that would be a redirect tuple pointing nowhere, but I
think it would be cleaner to repurpose the isDead and isRedirect bits as
a two-bit field with four states.  We'd have LIVE, DEAD, REDIRECT, and
this fourth state for a dead tuple that is not recyclable because we
know there's a link to it somewhere.  A scan would ignore such a tuple.
An insertion that arrives at such a tuple could either replace it (if
there's enough room on the page) or convert it to a redirect (if not).

Last night I was thinking the fourth state could be named TOMBSTONE,
but maybe it'd be better to use DEAD for this case and rename the
existing "removable dead tuple" state to PLACEHOLDER, since that case
only exists to preserve the offset numbers of other tuples on the
page.

Comments?

                        regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to