01.03.2016 03:09, Peter Geoghegan:
I was assigned an "action point" during the FOSDEM developer meeting:
"Post new version of btree consistency checker patch". I attach a new
WIP version of my consistency checker tool, amcheck. This patch is
proposed for 9.6, as an extension in contrib -- maybe we can still get
it in. This is the first time I've added any version of this to a
commitfest, although I've posted a couple of rough versions of this in
the past couple of years. The attached version has received a major
overhaul, and is primarily aimed at finding corruption in production
systems, although I think it will still have significant value for
debugging too. Maybe it can help with some of the B-Tree patches in
the final commitfest, for example. I also have some hope that it will
become a learning tool for people interested in how B-Tree indexes

To recap, the extension adds some SQL-callable functions that verify
certain invariant conditions hold within some particular B-Tree index.
These are the conditions that index scans rely on always being true.
The tool's scope may eventually cover other AMs, including heapam, but
nbtree seems like the best place to start.

Note that no function currently checks that the index is consistent
with the heap, which would be very useful (that's probably how I'd
eventually target the heapam, actually).

thank you for the contrib module. The patch itself looks great.
It was really useful to have such a tool while working on b-tree patches.


nbtree invariants that the tool verifies with just an AccessShareLock
on the relation are:

* That all items are in the correct, opclass order on each page.

* That the page "high key", if any, actually bounds the items on the page.

* That the last item on a page is less than or equal to the first item
on the next page (the page to its right). The idea here is that the
key space spans multiple pages, not just one page, so it make sense to
check the last item where we can.

With an ExclusiveLock + ShareLock, some addition invariants are verified:

* That child pages actually have their parent's downlink as a lower bound.

* Sane right links and left links at each level.

Obviously, this tool is all about distrusting the structure of a
B-Tree index. That being the case, it's not always totally clear where
to draw the line. I think I have the balance about right, though.


There are only 2 SQL callable functions in the extension, which are
very similar:

bt_index_check(index regclass)

bt_index_parent_check(index regclass)

The latter is more thorough than the former -- it performs all checks,
including those checks that I mentioned required an ExclusiveLock. So,
bt_index_check() only requires an AccessShareLock.
bt_index_parent_check() requires an ExclusiveLock on the index
relation, and a ShareLock on its heap relation, almost like REINDEX.
bt_index_parent_check() performs verification that is a superset of
the verification performed by bt_index_check() -- mostly, the extra
verification/work is that it verifies downlinks against child pages.

Both functions raise an error in the event of observing that an
invariant in a B-Tree was violated, such as items being out of order
on a page. I've written extensive documentation, which goes into
practical aspects of using amcheck effectively. It doesn't go into
significant detail about every invariant that is checked, but gives a
good idea of roughly what checks are performed.

I could almost justify only having one function with an argument about
the downlink/child verification, but that would be a significant
footgun given the varying locking requirements that such a function
would have.

I've done some testing. And I can say that both announced functions work well. BTW, if you know a good way to corrupt index (and do it reproducible) I'd be very glad to see it.

I created an index, then opened the file and changed the order of elements.
And bt_index_check() found the error successfully.

/* Ensure that the data is changed. */
SELECT * FROM bt_page_items('idx', 1);

 itemoffset | ctid  | itemlen | nulls | vars | data
          1 | (0,3) |      16 | f     | f    | 03 00 00 00 00 00 00 00
          2 | (0,2) |      16 | f     | f    | 02 00 00 00 00 00 00 00
          3 | (0,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
(3 rows)

/* Great, amcheck found an error. */
select bt_index_check('idx');
ERROR:  page order invariant violated for index "idx"
DETAIL: Lower index tid=(1,1) (points to heap tid=(0,3)) higher index tid=(1,2) (points to heap tid=(0,2)) page lsn=0/17800D0.


We never rely on something like holding on to a buffer pin as an
interlock for correctness (the vacuum interlock thing isn't generally
necessary, because we don't look at the heap at all). We simply pin +
BT_READ lock a buffer, copy it into local memory allocated by
palloc(), and then immediately release the buffer lock and drop the
pin. This is the same in all instances. There is never more than one
buffer lock or pin held at a time.

We do, on the other hand, have a detailed rationale for why it's okay
that we don't have an ExclusiveLock on the index relation for checks
that span the key space of more than one page by following right links
to compare items across sibling pages. This isn't the same thing as
having an explicit interlock like a pin -- our interlock is one
against *recycling* by vacuum, which is based on recentGlobalXmin.
This rationale requires expert review.

I'm not an expert, but I promise to take a closer look at locking.
I will send another email soon.

Trying to keep the tool as simple as possible, while still making it
do verification that is as useful as possible was my priority here,
not performance. Still, verification completes fairly quickly.
Certainly, it takes far less time than having to REINDEX the index,
and doesn't need too much memory. I think that in practice most
problems that can be detected by the B-Tree checker functions will be
detected with the lighter variant.

I do not see any performance issues.
I'm sure that if someone wants to check whether the index is corrupted, he certainly can wait a minute (.

But I have some concerns about compatibility with my patches.
I've tried to call bt_index_check() over my "including" patch [1] and caught a segfault.

LOG: server process (PID 31794) was terminated by signal 11: Segmentation fault
DETAIL:  Failed process was running: select bt_index_check('idx');

I do hope that my patch will be accepted in 9.6, so this conflict looks really bad. I think that error is caused by changes in pages layout. To save some space nonkey attributes are truncated when btree copies the indexed value into inner pages or into high key. You can look at index_reform_tuple() calls.

I wonder, whether you can add some check of catalog version to your module or this case requires more changes?

With second patch which implements compression [2], amcheck causes another error.
postgres=# insert into tbl select 1 from generate_series(0,5);
postgres=# SELECT * FROM bt_page_items('idx', 1);
 itemoffset |      ctid      | itemlen | nulls | vars | data

1 | (2147483664,6) | 56 | f | f | 01 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 02 00 00 00 00 00 03 00 00 00 00
00 04 00 00 00 00 00 05 00 00 00 00 00 06 00 00 00 00 00

postgres=# select bt_index_check('idx');
ERROR:  cannot check index "idx"
DETAIL:  index is not yet ready for insertions

But I'm sure that it's a problem of my patch. So I'll fix it and try again.

[1] https://commitfest.postgresql.org/9/433/
[2] https://commitfest.postgresql.org/9/494/

Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to