On 2018-04-07 01:27:13 +0200, Daniel Gustafsson wrote: > > On 07 Apr 2018, at 01:13, Andres Freund <and...@anarazel.de> wrote: > > > > On 2018-04-07 01:04:50 +0200, Daniel Gustafsson wrote: > >>> I'm fairly certain that the bug here is a simple race condition in the > >>> test (not the main code!): > >> > >> I wonder if it may perhaps be a case of both? > > > > See my other message about the atomic fallback bit. > > Yep, my MUA pulled it down just as I had sent this. Thanks for confirming my > suspicion.
But note it fails because the code using it is WRONG. There's a reason there's "unlocked" in the name. But even leaving that aside, it probably *still* be wrong if it were locked. It seems *extremely* dubious that we'll allow to re-enable the checksums while a worker is still doing stuff for the old cycle in the background. Consider what happens if the checksum helper is currently doing RequestCheckpoint() (something that can certainly take a *LONG*) while. Another process disables checksums. Pages get written out without checksums. Yet another process re-enables checksums. Helper process does SetDataChecksumsOn(). Which succeeds because if (ControlFile->data_checksum_version != PG_DATA_CHECKSUM_INPROGRESS_VERSION) { LWLockRelease(ControlFileLock); elog(ERROR, "Checksums not in inprogress mode"); } succeeds. Boom. Cluster with partially set checksums but marked as valid. Greetings, Andres Freund