Greetings,

* David Steele (da...@pgmasters.net) wrote:
> On 11/21/23 12:41, Andres Freund wrote:
> > Sure. They also receive a backup_label today. If an external solution 
> > forgets
> > to replace pg_control copied as part of the filesystem copy, they won't get 
> > an
> > error after the remove of backup_label, just like they don't get one today 
> > if
> > they don't put backup_label in the data directory.  Given that users don't 
> > do
> > the right thing with backup_label today, why can we rely on them doing the
> > right thing with pg_control?
> 
> I think reliable backup software does the right thing with backup_label, but
> if the user starts getting errors on recovery they the decide to remove
> backup_label. I know we can't do much about bad backup software, but we can
> at least make this a bit more resistant to user error after the fact.
> 
> It doesn't help that one of our hints suggests removing backup_label. In
> highly automated systems, the user might not even know they just restored
> from a backup. They are only in the loop because the restore failed and they
> are trying to figure out what is going wrong. When they remove backup_label
> the cluster comes up just fine. Victory!

Yup, this is exactly the issue.

> This is the scenario I've seen most often -- not the backup/restore process
> getting it wrong but the user removing backup_label on their own initiative.
> And because it yields such a positive result, at least initially, they
> remember in the future that the thing to do is to remove backup_label
> whenever they see the error.
> 
> If they only have pg_control, then their only choice is to get it right or
> run pg_resetwal. Most users have no knowledge of pg_resetwal so it will take
> them longer to get there. Also, I think that tool make it pretty clear that
> corruption will result and the only thing to do is a logical dump and
> restore after using it.

Agreed.

> There are plenty of ways a user can mess things up. What I'd like to prevent
> is the appearance of everything being OK when in fact they have corrupted
> their cluster. That's the situation we have now with backup_label. Is this
> new solution perfect? No, but I do think it checks several boxes, and is a
> worthwhile improvement.

+1.

As for the complaint about 'operators' having issue with the changes
we've been making in this area- where are those people complaining,
exactly?  Who are they?  I feel like we keep getting this kind of
push-back in this area from folks on this list but not from actual
backup software authors; all the complaints seem to either be 
speculative or unattributed pass-through from someone else.

What would really be helpful would be hearing from these individuals
directly as to what the issues are with the changes, such that perhaps
we can do things better in the future to avoid whatever the issue is
they're having with the changes.  Simply saying we shouldn't make
changes in this area isn't workable and the constant push-back is
actively discouraging to folks trying to make improvements.  Obviously
it's a biased view, but we've not had issues making the necessary
adjustments in pgbackrest with each release and I feel like if the
authors of wal-g or barman did that they would have spoken up.

Making a change as suggested which only helps pg_basebackup (and tools
like pgbackrest, since it's in C and can also make this particular
change) ends up leaving tools like wal-g and barman potentially still
with an easy way for users of those tools to corrupt their databases-
even though we've not heard anything from the authors of those tools
about issues with the proposed change, nor have there been a lot of
complaints from them about the prior changes to indicate that they'd
even have an issue with the more involved change.  Given the lack of
complaint about past changes, I'd certainly rather err on the side of
improved safety for users than on the side of the authors of these tools
possibly complaining.

What these changes have done is finally break things like omnipitr
completely, which hasn't been maintained in a very long time.  The
changes in v12 broke recovery with omnipitr but not backup, and folks
were trying to use omnipitr as recently as with v13[1].  Certainly
having a backup tool that only works for backup (fsvo works, anyway, as
it still used exclusive backup mode meaning that a crash during a backup
would cause the system to not come back up after...) but doesn't work
for recovery isn't exactly great and I'm glad that, now, an attempt to
use omnipitr to perform a backup will fail.  As with lots of other areas
of PG, folks need to read the release notes and potentially update their
code for new major versions.  If anything, the backup area is less of an
issue for this because the authors of the backup tools are able to make
the change (and who are often the ones pushing for these changes) and
the end-user isn't impacted at all.

Much the same can be said for wal-e, with users still trying to use it
even long after it was stated to be obsolete (the Obsolescence Notice[2]
was added in February 2022, though it hadn't been maintained for a while
before that, and an issue was opened in December 2022 asking for it to
be updated to v15[3]...).

Thanks,

Stephen

[1]: https://github.com/omniti-labs/omnipitr/issues/43
[2]: 
https://github.com/wal-e/wal-e/commit/f5b3e790fe10daa098b8cbf01d836c4885dc13c7
[3]: https://github.com/wal-e/wal-e/issues/433

Attachment: signature.asc
Description: PGP signature

Reply via email to