Re: [HACKERS] amcheck (B-Tree integrity checking tool)

Peter Geoghegan Thu, 17 Nov 2016 12:05:54 -0800

On Wed, Nov 16, 2016 at 1:03 PM, Andres Freund <and...@anarazel.de> wrote:
> I think the patch could use a pgindent run.


Okay.

> I'd really want a function that runs all check on a table.

Why not just use a custom SQL query? The docs have an example of one
such query, that verifies all user indexes, but there could be another
example.

>> +#define CHECK_SUPERUSER() { \
>> +             if (!superuser()) \
>> +                     ereport(ERROR, \
>> +                                     
>> (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
>> +                                      (errmsg("must be superuser to use 
>> verification functions")))); }
>
> Why is this a macro?

This was just copied from pageinspect, TBH. What would you prefer?

>> +#ifdef NOT_USED
>> +#define CORRUPTION           PANIC
>> +#define CONCERN                      WARNING
>> +#define POSITION             NOTICE
>> +#else
>> +#define CORRUPTION           ERROR
>> +#define CONCERN                      DEBUG1
>> +#define POSITION             DEBUG2
>> +#endif
>
> Hm, if we want that - and it doesn't seem like a bad idea - I think we
> should be make it available without recompiling.

I suppose, provided it doesn't let CORRUPTION elevel be < ERROR. That
might be broken if it was allowed.

> Hm, I think it'd be, for the long term, better if we'd move the btree
> check code to amcheck_nbtree.c or such.

Agreed.

>> +Datum
>> +bt_index_parent_check(PG_FUNCTION_ARGS)
>> +{
>> +     Oid                     indrelid = PG_GETARG_OID(0);

> Theoretically we should recheck that the oids still match, theoretically
> the locking here allows the index->table mapping to change. It's not a
> huge window tho...

>> +     /* Check for active uses of the index in the current transaction */
>> +     CheckTableNotInUse(indrel, "bt_index_parent_check");
>
> Why do we actually care?

This was left-over from duplicating REINDEX code. Indeed, we have no
reason to care at all.

>> +     if (!rel->rd_index->indisready)
>> +             ereport(ERROR,
>> +                             (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>> +                              errmsg("cannot check index \"%s\"",
>> +                                             RelationGetRelationName(rel)),
>> +                              errdetail("Index is not yet ready for 
>> insertions")));
>
> Why can't we check this? Seems valuable to allow checking after a
> partial CONCURRENTLY index build?

Okay.

> Why don't you use IndexIsReady et al? Not sure if the patch compiles
> against an earlier release (there weren't valid/ready/live dead fields,
> but instead it was compressed into fewer)?

This version is almost identical to the Github version of amcheck that
we've been using with 9.4+. Agreed on IndexIsReady().

>> +/*
>> + * bt_check_every_level()
>> + *
>> + * Main entry point for B-Tree SQL-callable functions.  Walks the B-Tree in
>> + * logical order, verifying invariants as it goes.
>> + *
>> + * It is the caller's responsibility to acquire appropriate heavyweight 
>> lock on
>> + * the index relation, and advise us if extra checks are safe when an
>> + * ExclusiveLock is held.  An ExclusiveLock is generally assumed to prevent 
>> any
>> + * kind of physical modification to the index structure, including
>> + * modifications that VACUUM may make.
>> + */
>
> Hm, what kind of locking does the kill_prior_tuple stuff use?

The killing itself, or the setting of LP_DEAD bit? Ordinary index
scans could concurrently set LP_DEAD with only an AccessShareLock.
But, that's just metadata that amcheck doesn't care about. Anything
that needs to actually recyle space based on finding an LP_DEAD bit
set is a write operation, that is precluded from running when
bt_index_parent_check() is called by its ExclusiveLock.
_bt_vacuum_one_page() is only called by one nbtinsert.c code path,
when it looks like a page split might be needed.

>> +static void
>> +bt_check_every_level(Relation rel, bool exclusivelock)
>> +{
>
>> +     /*
>> +      * Certain deletion patterns can result in "skinny" B-Tree indexes, 
>> where
>> +      * the fast root and true root differ.
>> +      *
>> +      * Start from the true root, not the fast root, unlike conventional 
>> index
>> +      * scans.  This approach is more thorough, and removes the risk of
>> +      * following a stale fast root from the meta page.
>> +      */
>> +     if (metad->btm_fastroot != metad->btm_root)
>> +             ereport(CONCERN,
>> +                             (errcode(ERRCODE_DUPLICATE_OBJECT),
>> +                              errmsg("fast root mismatch in index %s",
>> +                                             RelationGetRelationName(rel)),
>> +                              errdetail_internal("Fast block %u (level %u) "
>> +                                                                     
>> "differs from true root "
>> +                                                                     "block 
>> %u (level %u).",
>> +                                                                     
>> metad->btm_fastroot, metad->btm_fastlevel,
>> +                                                                     
>> metad->btm_root, metad->btm_level)));
>
> Hm, isn't that a potentially spurious report?

Well, it's not an error, so no. I think that it would be better with a
lower elevel, though, to indicate the low severity.

>> +                             ereport(CONCERN,
>> +                                             
>> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>> +                                              errmsg("block %u of index 
>> \"%s\" ignored",
>> +                                                             current, 
>> RelationGetRelationName(state->rel))));
>
> Hm, why is this worth a WARNING? Isn't it perfectly normal to have a few
> of these?

It is, but you shouldn't get very many of them, since pages that are
ignorable don't have sibling links at the second phase of deletion. I
guess WARNING might be too strong here, too.

> "next level down's leftmost block number" - it'd be good to simplify
> that a bit.

Okay.

>> +/*
>> + * bt_target_page_check()
>
> Why "target"?

Conceptually, the verification scan vists from true root to leaf, left
to right, and makes each page it finds along the way the target
exactly once. It's represented as such in the state that we pass
around -- there is some metadata stashed about the target there (e.g.,
LSN is kept there for easy retrieval by possible ereport(ERROR)
calls). I think that that makes things quite a lot clearer,
particularly when we need to discuss checks that happen against the
sibling or child of target (conceptually, we're still checking the
target). The description of what happens and why that is okay is a lot
clearer because we constantly refer to the current target page, IMV --
that's always our frame of reference.

>> +             /*
>> +              *   ********************
>> +              *   * Page order check *
>> +              *   ********************
>
> Isn't that item order, rather than page order?

I don't know...it checks that the items on the page are in the right order.

> (Various minor niggles about style)

Okay.

> This paragraph is too complicated to follow in the noisy environment
> where we both currently are ;)

As I pointed out yesterday, that paragraph is 95%+ of what makes the
patch hard to understand. It's worth it, though, to detect problems
with things like transposed pages while holding only an
AccessShareLock.

> downlinks?

Downlinks are an accepted terminology for the index tuples in internal
pages, that point to child pages (rather than having TID pointer to
table).

> "parent page's rightmost page"?

I guess that should be "parent page's rightmost child". This is
covered in the nbtree README:

"""
To preserve consistency on the parent level, we cannot merge the key space
of a page into its right sibling unless the right sibling is a child of
the same parent --- otherwise, the parent's key space assignment changes
too, meaning we'd have to make bounding-key updates in its parent, and
perhaps all the way up the tree.  Since we can't possibly do that
atomically, we forbid this case.  That means that the rightmost child of a
parent node can't be deleted unless it's the only remaining child, in which
case we will delete the parent too (see below).
''""

> I like this. Some of the more complex pieces towards the end of the
> field need some attention, there's a fair amount of word-smithing
> needed, and I do think we want to make the structural changes outlined
> above, but besides these, imo fairly simple adaptions, I do think this
> is useful and not that far from being committable.

Cool. I'll try to get a revision out soon. I'm happy to do that much.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

Reply via email to