Re: Online enabling of checksums

Andres Freund Fri, 06 Apr 2018 16:52:16 -0700

On 2018-04-07 01:27:13 +0200, Daniel Gustafsson wrote:
> > On 07 Apr 2018, at 01:13, Andres Freund <and...@anarazel.de> wrote:
> > 
> > On 2018-04-07 01:04:50 +0200, Daniel Gustafsson wrote:
> >>> I'm fairly certain that the bug here is a simple race condition in the
> >>> test (not the main code!):
> >> 
> >> I wonder if it may perhaps be a case of both?
> > 
> > See my other message about the atomic fallback bit.
> 
> Yep, my MUA pulled it down just as I had sent this.  Thanks for confirming my
> suspicion.


But note it fails because the code using it is WRONG. There's a reason
there's "unlocked" in the name. But even leaving that aside, it probably
*still* be wrong if it were locked.

It seems *extremely* dubious that we'll allow to re-enable the checksums
while a worker is still doing stuff for the old cycle in the
background.  Consider what happens if the checksum helper is currently
doing RequestCheckpoint() (something that can certainly take a *LONG*)
while. Another process disables checksums. Pages get written out
without checksums. Yet another process re-enables checksums. Helper
process does SetDataChecksumsOn(). Which succeeds because

        if (ControlFile->data_checksum_version != 
PG_DATA_CHECKSUM_INPROGRESS_VERSION)
        {
                LWLockRelease(ControlFileLock);
                elog(ERROR, "Checksums not in inprogress mode");
        }

succeeds.  Boom. Cluster with partially set checksums but marked as
valid.


Greetings,

Andres Freund

Re: Online enabling of checksums

Reply via email to