[HACKERS] Re: heap/SLRU verification, relfrozenxid cut-off, and freeze-the-dead bug (Was: amcheck (B-Tree integrity checking tool))

Peter Geoghegan Wed, 18 Oct 2017 12:47:15 -0700

On Mon, Oct 16, 2017 at 8:09 PM, Noah Misch <n...@leadboat.com> wrote:
> That presupposes construction of two pieces of software, the server and the
> checker, such that every disagreement is a bug in the server.  But checkers
> get bugs just like servers get bugs.


You make a good point, which is that *some* code must be wrong when an
error is raised and hardware is not to blame, but ISTM that the nuance
of that really matters. The checker seems much less likely to be where
bugs are, for three reasons:

* There is far less code for us to maintain as compared to the volume
of backend code that is effectively tested (again, not including the
hidden universe of complex, unauditable firmware code that could be
involved these days).

* Much of the actual checking (as much as possible) is outsourced to
core code that is already critically important. If that has bugs in
it, then it is unlikely to be defined as an amcheck bug.

* Knowing all this, we can go out of our way to do a good job of
getting the design right the first time. (A sound design is far more
important than actually having zero bugs.)

Obviously there could be unambiguous bugs; I'm not arguing otherwise.
I just hope that we can push this model as far as we need to, and
perhaps accommodate verifiability as a goal for *future* development
projects. We're almost doing that today; debuggability of on-disk
structures is something that the community already values. This is the
logical next step, IMV.

> Checkers do provide a sort of
> double-entry bookkeeping.  When a reproducible test case prompts a checker
> complaint, we'll know *some* code is wrong.

I really like your double entry bookkeeping analogy. A tiny
discrepancy will bubble up, even in a huge organization, and yet the
underlying principles are broad and not all that complicated.

> That's an admirable contribution.

Thank you. I just hope that it becomes something that other
contributors have some sense of ownership over.

> I'm essentially saying that the server is innocent until proven guilty.  It
> would be cool to have a self-contained specification of PostgreSQL data files,
> but where the server disagrees with the spec without causing problem
> behaviors, we'd ultimately update the spec to fit the server.

I might not have done a good job of explaining my position. I agree
with everything you say here. I would like to see amcheck become a
kind of vehicle for discussing things that we already discuss. You get
a nice tool at the end, that clarifies and increases confidence in the
original understanding over time (or acts as a canary-in-the-coalmine
forcing function when the original understanding turns out to be
faulty). The tool itself is ultimately just a bonus.

Bringing it back to the concrete freeze-the-dead issue, and the
question of an XID-cutoff for safely interrogating CLOG: I guess it
will be possible to assess a HOT chain as a whole. We can probably do
this safely while holding a super-exclusive lock on the buffer. I can
probably find a way to ensure this only needs to happen in a rare slow
path, when it looks like the invariant might be violated but we need
to make sure (I'm already following this pattern in a couple of
places). Realistically, there will be some amount of "try it and see"
here.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: heap/SLRU verification, relfrozenxid cut-off, and freeze-the-dead bug (Was: amcheck (B-Tree integrity checking tool))

Reply via email to