On 04.10.23 22:08, Robert Haas wrote:
- I would like some feedback on the generation of WAL summary files.
Right now, I have it enabled by default, and summaries are kept for a
week. That means that, with no additional setup, you can take an
incremental backup as long as the reference backup was taken in the
last week. File removal is governed by mtimes, so if you change the
mtimes of your summary files or whack your system clock around, weird
things might happen. But obviously this might be inconvenient. Some
people might not want WAL summary files to be generated at all because
they don't care about incremental backup, and other people might want
them retained for longer, and still other people might want them to be
not removed automatically or removed automatically based on some
criteria other than mtime. I don't really know what's best here. I
don't think the default policy that the patches implement is
especially terrible, but it's just something that I made up and I
don't have any real confidence that it's wonderful.

The easiest answer is to have it off by default. Let people figure out what works for them. There are various factors like storage, network, server performance, RTO that will determine what combination of full backup, incremental backup, and WAL replay will satisfy someone's requirements. I suppose tests could be set up to determine this to some degree. But we could also start slow and let people figure it out themselves. When pg_basebackup was added, it was also disabled by default.

If we think that 7d is a good setting, then I would suggest to consider, like 10d. Otherwise, if you do a weekly incremental backup and you have a time change or a hiccup of some kind one day, you lose your backup sequence.

Another possible answer is, like, 400 days? Because why not? What is a reasonable upper limit for this?

- It's regrettable that we don't have incremental JSON parsing; I
think that means anyone who has a backup manifest that is bigger than
1GB can't use this feature. However, that's also a problem for the
existing backup manifest feature, and as far as I can see, we have no
complaints about it. So maybe people just don't have databases with
enough relations for that to be much of a live issue yet. I'm inclined
to treat this as a non-blocker,

It looks like each file entry in the manifest takes about 150 bytes, so 1 GB would allow for 1024**3/150 = 7158278 files. That seems fine for now?

- Right now, I have a hard-coded 60 second timeout for WAL
summarization. If you try to take an incremental backup and the WAL
summaries you need don't show up within 60 seconds, the backup times
out. I think that's a reasonable default, but should it be
configurable? If yes, should that be a GUC or, perhaps better, a
pg_basebackup option?

The current user experience of pg_basebackup is that it waits possibly a long time for a checkpoint, and there is an option to make it go faster, but there is no timeout AFAICT. Is this substantially different? Could we just let it wait forever?

Also, does waiting for checkpoint and WAL summarization happen in parallel? If so, what if it starts a checkpoint that might take 15 min to complete, and then after 60 seconds it kicks you off because the WAL summarization isn't ready. That might be wasteful.

- I'm curious what people think about the pg_walsummary tool that is
included in 0006. I think it's going to be fairly important for
debugging, but it does feel a little bit bad to add a new binary for
something pretty niche.

This seems fine.

Is the WAL summary file format documented anywhere in your patch set yet? My only thought was, maybe the file format could be human-readable (more like backup_label) to avoid this. But maybe not.



Reply via email to