Re: The danger of deleting backup_label

David Steele Mon, 16 Oct 2023 08:45:26 -0700

On 10/16/23 10:55, Robert Haas wrote:

On Sat, Oct 14, 2023 at 11:33 AM David Steele <da...@pgmasters.net> wrote:

All of this is fixable in HEAD, but seems incredibly dangerous to back
patch. Even so, I have attached the patch in case somebody sees an
opportunity that I do not.


I really do not think we should be even thinking about back-patching
something like this. It's clearly not a bug fix, although I'm sure
that someone can try to characterize it that way, if they want to make
the well-worn argument that any behavior they don't like is a bug. But
that's a pretty lame argument. Usage errors on the part of users are
not bugs, even if we've coded the software in such a way as to make
those errors more likely.

Hmmm, the reason to back patch this is that it would fix [1], which surelooks like a problem to me even if it is not a "bug". We can certainlyrequire backup software to retry pg_control until the checksum is validbut that seems like a pretty big ask, even considering how complicatedbackup is.

I investigated this as a solution to [1] because it seemed like a bettersolution that what we have in [1]. But once I got into the weeds it wasobvious that this wasn't going to be something we could back patch.

I think what we ought to be talking about is whether a change like
this is a good idea even in master. I don't think it's a terrible
idea, but I'm also not sure that it's a good idea. The problem is that
if you're doing the right thing with your backup_label, then this is
unnecessary, and if you're doing the wrong thing, then why should you

do the right thing about this?

First and foremost, this solves the problem of pg_control being tornwhen read by the backup software. It can't be torn if it is not there.

There are also other advantages -- we can massage pg_control beforewriting it back out. This already happens, but before that happens thereis a copy of the (prior) running pg_control there that can mess up theprocess.

I mean, admittedly you can't just
ignore a fatal error, but I think people will just run pg_resetwal,
which is even worse than starting from the wrong checkpoint.

If you start from the last checkpoint (which is what will generally bestored in pg_control) then the effect is pretty similar.

I feel
like in cases where a customer I'm working with has a bad backup,
their entire focus is on doing something to that backup to get a
running system back, whatever it takes. It's already too late at that
point to fix the backup procedure - they only have the backups they
have. You could hope people would do test restores before disaster
strikes, but people who are that prepared are probably running a real
backup tool and will never have this problem in the first place.

Right now the user can remove backup_label and get a "successful"restore and not realize that they have just corrupted their cluster.This is independent of the backup/restore tool doing all the right things.

My goal here is to narrow the options to try and make it so there is*one* valid procedure that will work. For this patch the idea is thatthey *must* start Postgres to get a valid pg_control from thebackup_label. Any other action leads to a fatal error.

Note that the current patch is very WIP and does not actually doeverything I'm talking about here. I was just trying to see if it couldbe used to solve the problem in [1]. It can't.

Perhaps that's all too pessimistic. I don't know. Certainly, other
people can have experiences that are different than mine. But I feel
like I struggle to think of a case where this would have prevented a
bad outcome, and that makes me wonder whether it's really a good idea
to complicate the system.

I'm specifically addressing cases like those that came up (twice!) in[2]. This is the main place I see people stumbling these days. If evenhackers can make this mistake then we should do better.


Regards,
-David

[1]https://www.postgresql.org/message-id/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de[2] [1]https://www.postgresql.org/message-id/flat/CAM_vCudkSjr7NsNKSdjwtfAm9dbzepY6beZ5DP177POKy8%3D2aw%40mail.gmail.com#746e492bfcd2667635634f1477a61288

Re: The danger of deleting backup_label

Reply via email to