On Wed, Aug 3, 2016 at 1:31 AM, Kyotaro HORIGUCHI
<horiguchi.kyot...@lab.ntt.co.jp> wrote:
> The first line is emitted for simultaneous deletion of a index
> page, which is impossible by design in a consistent state so the
> complained situation should be the result of an index corruption
> before upgading, specifically, inconsistent sibling pointers
> around a deleted page.

> My point here is that if concurrent deletion can't be perfomed by
> the current implement, this while loop could be removed and
> immediately error out or log a message,

Perhaps I have misunderstood, but I must ask: Are you aware that 9.4
has commit efada2b8, which fixes a bug with page deletion, that never
made it into earlier branches?

It is worth noting that anything that can make VACUUM in particular
raise an error is considered problematic. For example, we do not allow
VACUUM to finish incomplete B-Tree page splits, even though we could,
because there might be no disk space for even one extra page at the
worst possible time (we VACUUM in part to free disk space, of course).
We really want to let VACUUM finish, if at all possible, even if it
sees something that indicates data corruption.

I remember that during the review of the similar B-Tree page split
patch (not efada2b8, but rather commit 40dae7ec, which was also in
9.4), I found bugs due to functions not being very clear about whether
a buffer lock and/or pin were held upon returning from a function in
all cases, including very rare cases or "can't happen" cases. So, this
seems kind of familiar.

Code comments in the relevant function say this:

 * Returns 'false' if the page could not be unlinked (shouldn't happen).
 * If the (new) right sibling of the page is empty, *rightsib_empty is set
 * to true.
 */
static bool
_bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
{

What we see here is that this "shouldn't happen" condition does, in
fact, happen very rarely, including in cases where the target is not a
leaf page.

I think there might well be a real bug here, because we are not
careful enough in ensuring that the _bt_unlink_halfdead_page()
"contract" holds when something that "can't happen" does, in fact,
happen:

    /*
     * Release the target, if it was not the leaf block.  The leaf is always
     * kept locked.
     */
    if (target != leafblkno)
        _bt_relbuf(rel, buf);

    return true;
}

(Assuming you can call this code at the end of
_bt_unlink_halfdead_page() its "contract".)

It's not good at all if the code is stuck in a way that prevents
VACUUM from finishing, as I said. That seems to be the case here,
judging from Horiguchi-san's log output, which shows an ERROR from
"lock main 13879 is not held". The problem is not so much that a
"can't happen" thing happens (data corruption will sometimes make "the
impossible" possible, after all -- we know this, and write code
defensively because of this). The bug is that there is any ERROR for
VACUUM that isn't absolutely unavoidable (the damage has been done
anyway -- the index is already corrupt).

I'll need to think about this some more, when I have more time.
Perhaps tomorrow.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to