On 11/21/23 12:41, Andres Freund wrote:
On 2023-11-21 07:42:42 -0400, David Steele wrote:
On 11/20/23 19:58, Andres Freund wrote:
On 2023-11-21 08:52:08 +0900, Michael Paquier wrote:
On Mon, Nov 20, 2023 at 12:37:46PM -0800, Andres Freund wrote:
Given that, I wonder if what we should do is to just add a new field to
pg_control that says "error out if backup_label does not exist", that we set
when creating a streaming base backup
That would mean that one still needs to take an extra step to update a
control file with this byte set, which is something you had a concern
with in terms of compatibility when it comes to external backup
solutions because more steps are necessary to take a backup, no?
I was thinking we'd just set it in the pg_basebackup style path, and we'd
error out if it's set and backup_label is present. But we'd still use
backup_label without the pg_control flag set.
So it'd just provide a cross-check that backup_label was not removed for
pg_basebackup style backup, but wouldn't do anything for external backups. But
imo the proposal to just us pg_control doesn't actually do anything for
external backups either - which is why I think my proposal would achieve as
much, for a much lower price.
I'm not sure why you think the patch under discussion doesn't do anything
for external backups. It provides the same benefits to both pg_basebackup
and external backups, i.e. they both receive the updated version of
pg_control.
Sure. They also receive a backup_label today. If an external solution forgets
to replace pg_control copied as part of the filesystem copy, they won't get an
error after the remove of backup_label, just like they don't get one today if
they don't put backup_label in the data directory. Given that users don't do
the right thing with backup_label today, why can we rely on them doing the
right thing with pg_control?
I think reliable backup software does the right thing with backup_label,
but if the user starts getting errors on recovery they the decide to
remove backup_label. I know we can't do much about bad backup software,
but we can at least make this a bit more resistant to user error after
the fact.
It doesn't help that one of our hints suggests removing backup_label. In
highly automated systems, the user might not even know they just
restored from a backup. They are only in the loop because the restore
failed and they are trying to figure out what is going wrong. When they
remove backup_label the cluster comes up just fine. Victory!
This is the scenario I've seen most often -- not the backup/restore
process getting it wrong but the user removing backup_label on their own
initiative. And because it yields such a positive result, at least
initially, they remember in the future that the thing to do is to remove
backup_label whenever they see the error.
If they only have pg_control, then their only choice is to get it right
or run pg_resetwal. Most users have no knowledge of pg_resetwal so it
will take them longer to get there. Also, I think that tool make it
pretty clear that corruption will result and the only thing to do is a
logical dump and restore after using it.
There are plenty of ways a user can mess things up. What I'd like to
prevent is the appearance of everything being OK when in fact they have
corrupted their cluster. That's the situation we have now with
backup_label. Is this new solution perfect? No, but I do think it checks
several boxes, and is a worthwhile improvement.
Regards,
-David
Regards,
-David