Re: pg_verifybackup: TAR format backup verification

Robert Haas Fri, 23 Aug 2024 13:32:30 -0700

On Wed, Aug 21, 2024 at 7:08 AM Amul Sul <[email protected]> wrote:
> I have reworked a few comments, revised error messages, and made some
> minor tweaks in the attached version.


Thanks.

> Additionally, I would like to discuss a couple of concerns regarding
> error placement and function names to gather your suggestions.
>
> 0007 patch: Regarding error placement:
>
> 1. I'm a bit unsure about the (bytes_read != m->size) check that I
> placed in verify_checksum() and whether it's in the right place. Per
> our previous discussion, this check is applicable to plain backup
> files since they can change while being read, but not for files
> belonging to tar backups. For consistency, I included the check for
> tar backups as well, as it doesn't cause any harm. Is it okay to keep
> this check in verify_checksum(), or should I move it back to
> verify_file_checksum() and apply it only to the plain backup format?

I think it's a good sanity check. For a long time I thought it was
triggerable until I eventually realized that you just get this message
if the file size is wrong:

pg_verifybackup: error: "pg_xact/0000" has size 8203 on disk but size
8192 in the manifest

After realizing that, I agree with you that this doesn't really seem
reachable for tar backups, but I don't think it hurts anything either.

While I was investigating this question, I discovered this problem:

$ pg_basebackup -cfast -Ft -Dx
$ pg_verifybackup -n x
backup successfully verified
$ mkdir x/tmpdir
$ tar -C x/tmpdir -xf x/base.tar
$ rm x/base.tar
$ tar -C x/tmpdir -cf x/base.tar .
$ pg_verifybackup -n x
<lots of errors>

It appears that the reason why this fails is that the paths in the
original base.tar from the server do not include "./" at the
beginning, and the ones that I get when I create my own tarfile have
that. But that shouldn't matter. Anyway, I was able to work around it
like this:

$ tar -C x/tmpdir -cf x/base.tar `(cd x/tmpdir; echo *)`

Then the result verifies. But I feel like we should have some test
cases that do this kind of stuff so that there is automated
verification. In fact, the current patch seems to have no negative
test cases at all. I think we should test all the cases in
003_corruption.pl with tar format backups as well as with plain format
backups, which we could do by untarring one of the archives, messing
something up, and then retarring it. I also think we should have some
negative test case specific to tar-format backup. I suggest running a
coverage analysis and trying to craft test cases that hit as much of
the code as possible. There will probably be some errors you can't
hit, but you should try to hit the ones you can.

> 2. For the verify_checksum() function, I kept the argument name as
> bytes_read. Should we rename it to something more meaningful like
> computed_bytes, computed_size, or checksum_computed_size?

I think it's fine the way you have it.

> 0011 patch: Regarding function names:
>
> 1. named the function verify_tar_backup_file() to align with
> verify_plain_backup_file(), but it does not perform the complete
> verification as verify_plain_backup_file does. Not sure if it is the
> right name.

I was thinking of something like precheck_tar_backup_file().

> 2. verify_tar_file_contents() is the second and final part of tar
> backup verification. Should its name be aligned with
> verify_tar_backup_file()? I’m unsure what the best name would be.
> Perhaps verify_tar_backup_file_final(), but then
> verify_tar_backup_file() would need to be renamed to something like
> verify_tar_backup_file_initial(), which might be too lengthy.

verify_tar_file_contents() actually verifies the contents of all the
tar files, not just one, so the name is a bit confusing. Maybe
verify_all_tar_files().

> 3. verify_tar_contents() is the core of verify_tar_file_contents()
> that handles the actual verification. I’m unsure about the current
> naming. Should we rename it to something like
> verify_tar_contents_core()? It wouldn’t be an issue if we renamed
> verify_tar_file_contents() as pointed in point #2.

verify_one_tar_file()?

But with those renames, I think you really start to see why I'm not
very comfortable with verify_backup_directory(). The tar and plain
format cases aren't really doing the same thing. We're just gluing
them into a single function anyway.

I am also still uncomfortable with the way you've refactored some of
this so that we end up with very small amounts of code far away from
other code that they influence. Like you end up with this:

        /* Check the backup manifest entry for this file. */
        m = verify_manifest_entry(context, relpath, sb.st_size);

        /* Validate the pg_control information */
        if (should_verify_control_data(context->manifest, m))
...
        if (show_progress && !context->skip_checksums &&
                should_verify_checksum(m))

But verify_manifest_entry can return NULL or it can set m->bad and
either of those change the result of should_verify_control_data() and
should_verify_checksum(), but none of that is obvious when you just
look at this. Admittedly, the code in master isn't brilliant in terms
of making it obvious what's happening either, but I think this is
worse. I'm not really sure what I think we should do about that yet,
but I'm uncomfortable with it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_verifybackup: TAR format backup verification

Reply via email to