checksum verification code breaks backups in v16-

Robert Haas Tue, 03 Dec 2024 08:40:10 -0800

In 025584a168a4b3002e19350bb8db0ebf1fd10235, which shipped with v17, I
changed the way that a base backup decides whether files have
checksums. At the time, I thought this was harmless refactoring, but
it turns out that it was better than harmless. The old way can cause
pg_basebackup failures. To reproduce, I believe you just need:


1. A cluster created with v16 or earlier with checksums enabled (i.e.
initdb -k).
2. A file inside "global", "base", or a tablespace directory whose
name contains a period which is followed by something that atoi() is
unable to convert to a positive number.

Then pg_basebackup will fail like this:

pg_basebackup: error: could not get COPY data stream: ERROR:  invalid
segment number 0 in file "pg_control.old"

One way to fix this is to back-port
025584a168a4b3002e19350bb8db0ebf1fd10235. Before that commit, we
assumed all files had checksums unless the filename like one of the
files that we know isn't supposed to be checksummed. After that
commit, we assume files don't have checksums unless the file format
looks like something we know should be checksummed. That inherently
avoids trying to checksum things we're not expecting to see, such as a
pg_control.old file.

If we want a narrower and thus less-risky fix, we could consider just
adjusting this code here:

                segmentno = atoi(segmentpath + 1);
                if (segmentno == 0)
                    ereport(ERROR,
                            (errmsg("invalid segment number %d in file \"%s\"",
                                    segmentno, filename)));

I think we could just replace the ereport() with verify_checksum =
false (and update the comments). That would leave open the possibility
of trying to checksum some random file with a name like
this_file_should_not_be_here, which we'll interpret as segment 0
because the name does not contain a dot; but if the file were called
this.file.should.not.be.here, we'd not try to checksum it at all
because "here" is not an integer. That's a little random, but it
avoids taking a position on whether
025584a168a4b3002e19350bb8db0ebf1fd10235 is fully correct. It simply
takes the narrower position that we shouldn't fail the entire backup
because we're unable to perform a checksum calculation that in the
worst case would only produce a warning anyway.

At the moment, I think I slightly prefer the narrower fix, but it
might be a little too narrow, since the resultant behavior as
described in the preceding paragraph is somewhat wonky and
nonsensical.

Opinions?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

checksum verification code breaks backups in v16-

Reply via email to