On 4/19/24 00:50, Robert Haas wrote:
On Wed, Apr 17, 2024 at 7:09 PM David Steele <da...@pgmasters.net> wrote:

Fair enough. I accept that your reasoning is not random, but I'm still
not very satisfied that the user needs to run a separate and rather
expensive process to do the verification when pg_combinebackup already
has the necessary information at hand. My guess is that most users will
elect to skip verification.

I think you're probably right that a lot of people will skip it; I'm
just less convinced than you are that it's a bad thing. It's not a
*great* thing if people skip it, but restore time is actually just
about the worst time to find out that you have a problem with your
backups. I think users would be better served by verifying stored
backups periodically when they *don't* need to restore them.

Agreed, running verify regularly is a good idea, but in my experience most users are only willing to run verify once they suspect (or know) there is an issue. It's a pretty expensive process depending on how many backups you have and where they are stored.

> Also,
saying that we have all of the information that we need to do the
verification is only partially true:

- we do have to parse the manifest anyway, but we don't have to
compute checksums anyway, and I think that cost can be significant
even for CRC-32C and much more significant for any of the SHA variants

- we don't need to read all of the files in all of the backups. if
there's a newer full, the corresponding file in older backups, whether
full or incremental, need not be read

- incremental files other than the most recent only need to be read to
the extent that we need their data; if some of the same blocks have
been changed again, we can economize

How much you save because of these effects is pretty variable. Best
case, you have a 2-backup chain with no manifest checksums, and all
verification will have to do that you wouldn't otherwise need to do is
walk each older directory tree in toto and cross-check which files
exist against the manifest. That's probably cheap enough that nobody
would be too fussed. Worst case, you have a 10-backup (or whatever)
chain with SHA512 checksums and, say, a 50% turnover rate. In that
case, I think having verification happen automatically could be a
pretty major hit, both in terms of I/O and CPU. If your database is
1TB, it's ~5.5TB of read I/O (because one 1TB full backup and 9 0.5TB
incrementals) instead of ~1TB of read I/O, plus the checksumming.

Now, obviously you can still feel that it's totally worth it, or that
someone in that situation shouldn't even be using incremental backups,
and it's a value judgement, so fair enough. But my guess is that the
efforts that this implementation makes to minimize the amount of I/O
required for a restore are going to be important for a lot of people.

Sure -- pg_combinebackup would only need to verify the data that it uses. I'm not suggesting that it should do an exhaustive verify of every single backup in the chain. Though I can see how it sounded that way since with pg_verifybackup that would pretty much be your only choice.

The beauty of doing verification in pg_combinebackup is that it can do a lot less than running pg_verifybackup against every backup but still get a valid result. All we care about is that the output is correct -- if there is corruption in an unused part of an earlier backup pg_combinebackup doesn't need to care about that.

As far as I can see, pg_combinebackup already checks most of the boxes. The only thing I know that it can't do is detect missing files and that doesn't seem like too big a thing to handle.

Regards,
-David


Reply via email to