Greetings, * Andres Freund (and...@anarazel.de) wrote: > I recently mentioned to Robert (and also Heikki earlier), that I think I see a > way to detect an omitted backup_label in a relevant subset of the cases (it'd > apply to the pg_control as well, if we moved to that). Robert encouraged me > to share the idea, even though it does not provide complete protection.
That would certainly be nice. > The subset I think we can address is the following: > > a) An omitted backup_label would lead to corruption, i.e. without the > backup_label we won't start recovery at the right position. Obviously it'd > be better to also catch a wrong procedure when it'd not cause corruption - > perhaps my idea can be extended to handle that, with a small bit of > overhead. > > b) The backup has been taken from a primary. Unfortunately that probably can't > be addressed - but the vast majority of backups are taken from a primary, > so I think it's still a worthwhile protection. Agreed that this is a worthwhile set to try and address, even if we can't address other cases. > Here's my approach > > 1) We add a XLOG_BACKUP_START WAL record when starting a base backup on a > primary, emitted just *after* the checkpoint completed > > 2) When replaying a base backup start record, we create a state file that > includes the corresponding LSN in the filename > > 3) On the primary, the state file for XLOG_BACKUP_START is *not* created at > that time. Instead the state file is created during pg_backup_stop(). > > 4) When replaying a XLOG_BACKUP_END record, we verif that the state file > created by XLOG_BACKUP_START is present, and error out if not. Backups > that started before the redo LSN from backup_label are ignored > (necessitates remembering that LSN, but we've been discussing that anyway). > > > Because the backup state file on the primary is only created during > pg_backup_stop(), a copy of the data directory taken between pg_backup_start() > and pg_backup_stop() does *not* contain the corresponding "backup state > file". Because of this, an omitted backup_label is detected if recovery does > not start early enough - recovery won't encounter the XLOG_BACKUP_START record > and thus would not create the state file, leading to an error in 4). While I see the idea here, I think, doesn't it end up being an issue if things happen like this: pg_backup_start -> XLOG_BACKUP_START WAL written -> new checkpoint happens -> pg_backup_stop -> XLOG_BACKUP_STOP WAL written -> crash In that scenario, we'd go back to the new checkpoint (the one *after* the checkpoint that happened before we wrote XLOG_BACKUP_START), start replaying, and then hit the XLOG_BACKUP_STOP and then error out, right? Even though we're actually doing crash recovery and everything should be fine as long as we replay all of the WAL. Perhaps we can make the pg_backup_stop and(/or?) the writing out of XLOG_BACKUP_STOP wait until just before the next checkpoint and hopefully minimize that window ... but I'm not sure if we could make that window zero and what happens if someone does end up hitting it? Doesn't seem like there's any way around it, which seems like it might be a problem. I suppose it wouldn't be hard to add some option to tell PG to ignore the XLOG_BACKUP_STOP ... but then that's akin to removing backup_label which lands us possibly back into the issue of people mis-using that. > It is not a problem that the primary does not create the state file before the > pg_backup_stop() - if the primary crashes before pg_backup_stop(), there is no > XLOG_BACKUP_END and thus no error will be raised. It's a bit odd that the > sequence differs between normal processing and recovery, but I think that's > nothing a good comment couldn't explain. Right, crashing before pg_backup_stop() is fine, but crashing *after* would be an issue, I think, as outlined above, until the next checkpoint completes, so we've moved the window but not eliminated it. > I haven't worked out the details, but I think we might be able extend this to > catch errors even if there is no checkpoint during the base backup, by > emitting the WAL record *before* the RequestCheckpoint(), and creating the > corresponding state file during backup_label processing at the start of > recovery. That'd probably make the logic for when we can remove the backup > state files a bit more complicated, but I think we could deal with that. Not entirely following this- are you meaning that we might be able to make something here work in the case where we don't have pg_backup_start() wait for a checkpoint to happen (which I have some serious doubts about?), or are you saying that the above doesn't work unless there's at least one post-pg_backup_start() checkpoint? I don't immediately see why that would be the case though. Also, if we wrote out the XLOG_BACKUP_START before the checkpoint that we start replay from and instead move that logic to backup_label processing ... doesn't that end up not working in the same case as we have today- where someone decides to remove backup_label? Going to stop guessing here as I'm clearly not understanding something about this part. Maybe this is the part that's addressing the concern raised above though and if so, sorry, but would appreciate some additional explanation. Thanks! Stephen
signature.asc
Description: PGP signature