On 11/21/23 12:41, Andres Freund wrote:

On 2023-11-21 07:42:42 -0400, David Steele wrote:
On 11/20/23 19:58, Andres Freund wrote:
On 2023-11-21 08:52:08 +0900, Michael Paquier wrote:
On Mon, Nov 20, 2023 at 12:37:46PM -0800, Andres Freund wrote:
Given that, I wonder if what we should do is to just add a new field to
pg_control that says "error out if backup_label does not exist", that we set
when creating a streaming base backup

That would mean that one still needs to take an extra step to update a
control file with this byte set, which is something you had a concern
with in terms of compatibility when it comes to external backup
solutions because more steps are necessary to take a backup, no?

I was thinking we'd just set it in the pg_basebackup style path, and we'd
error out if it's set and backup_label is present. But we'd still use
backup_label without the pg_control flag set.

So it'd just provide a cross-check that backup_label was not removed for
pg_basebackup style backup, but wouldn't do anything for external backups. But
imo the proposal to just us pg_control doesn't actually do anything for
external backups either - which is why I think my proposal would achieve as
much, for a much lower price.

I'm not sure why you think the patch under discussion doesn't do anything
for external backups. It provides the same benefits to both pg_basebackup
and external backups, i.e. they both receive the updated version of
pg_control.

Sure. They also receive a backup_label today. If an external solution forgets
to replace pg_control copied as part of the filesystem copy, they won't get an
error after the remove of backup_label, just like they don't get one today if
they don't put backup_label in the data directory.  Given that users don't do
the right thing with backup_label today, why can we rely on them doing the
right thing with pg_control?

I think reliable backup software does the right thing with backup_label, but if the user starts getting errors on recovery they the decide to remove backup_label. I know we can't do much about bad backup software, but we can at least make this a bit more resistant to user error after the fact.

It doesn't help that one of our hints suggests removing backup_label. In highly automated systems, the user might not even know they just restored from a backup. They are only in the loop because the restore failed and they are trying to figure out what is going wrong. When they remove backup_label the cluster comes up just fine. Victory!

This is the scenario I've seen most often -- not the backup/restore process getting it wrong but the user removing backup_label on their own initiative. And because it yields such a positive result, at least initially, they remember in the future that the thing to do is to remove backup_label whenever they see the error.

If they only have pg_control, then their only choice is to get it right or run pg_resetwal. Most users have no knowledge of pg_resetwal so it will take them longer to get there. Also, I think that tool make it pretty clear that corruption will result and the only thing to do is a logical dump and restore after using it.

There are plenty of ways a user can mess things up. What I'd like to prevent is the appearance of everything being OK when in fact they have corrupted their cluster. That's the situation we have now with backup_label. Is this new solution perfect? No, but I do think it checks several boxes, and is a worthwhile improvement.

Regards,
-David

Regards,
-David


Reply via email to