Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

Thomas Munro Mon, 24 Jul 2023 18:36:54 -0700

On Tue, Jul 25, 2023 at 8:18 AM Robert Haas <robertmh...@gmail.com> wrote:
> (Yeah, I know we have code to verify checksums during a base
> backup, but as discussed elsewhere, it doesn't work.)


BTW the the code you are referring to there seems to think 4KB
page-halves are atomic; not sure if that's imagining page-level
locking in ancient Linux (?), or imagining default setvbuf() buffer
size observed with some specific implementation of fread(), or
confusing power-failure-sector-based atomicity with concurrent access
atomicity, or something else, but for the record what we actually see
in this scenario on ext4 is the old/new page contents mashed together
on much smaller boundaries (maybe cache lines), caused by duelling
concurrent memcpy() to/from, independent of any buffer/page-level
implementation details we might have been thinking of with that code.
Makes me wonder if it's even technically sound to examine the LSN.

> It's also why we
> have to force full-page write on during a backup. But the whole thing
> is nasty because you can't really verify anything about the backup you
> just took. It may be full of gibberish blocks but don't worry because,
> if all goes well, recovery will fix it. But you won't really know
> whether recovery actually does fix it. You just kind of have to cross
> your fingers and hope.

Well, not without also scanning the WAL for FPIs, anyway...  And
conceptually, that's why I think we probably want an 'FPI' of the
control file somewhere.

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

Reply via email to