Greetings, (Adding David Steele into the CC on this one...)
* Thomas Munro (thomas.mu...@gmail.com) wrote: > This is a frustrating thread, because despite the last patch solving > most of the problems we discussed, it doesn't address the > low-level-backup procedure in a nice way. We'd have to tell users > they have to flock that file, or add a new step "pg_controldata --raw > > pg_control", which seems weird when they already have a connection > to the server. flock'ing the file strikes me as dangerous to ask external tools to do due to the chances of breaking the running system if they don't do it right. I'm generally open to the idea of having the backup tool have to do more complicated work to be correct but that just seems likely to cause issues. Also- haven't looked yet, but I'm not sure that's even possible if your backup tool is running as a user who only has read access to the data directory? I don't want us to give up on that feature. > Maybe it just doesn't matter if eg the pg_controldata program can > spuriously fail if pointed at a running system, and I was being too > dogmatic trying to fix even that. Maybe we should just focus on > fixing backups. Even there, I am beginning to suspect we are solving > this problem in the wrong place when a higher level change could > simplify the problem away. For a running system.. perhaps pg_controldata should be connecting to the database and calling functions there? Or just complain that the system is online and tell the user to do that? > Idea for future research: Perhaps pg_backup_stop()'s label-file > output should include the control file image (suitably encoded)? Then > the recovery-from-label code could completely ignore the existing > control file, and overwrite it using that copy. It's already > partially ignoring it, by using the label file's checkpoint LSN > instead of the control file's. Perhaps the captured copy could > include the correct LSN already, simplifying that code, and the low > level backup procedure would not need any additional steps or caveats. > No more atomicity problem for low-level-backups... but probably not > something we would back-patch, for such a rare failure mode. I like this general direction and wonder if we could possibly even push a bit harder on it: have the backup_label include the control file's contents in some form that is understood and then tell tools to *not* copy the running pg_control file ... and maybe even complain if a pg_control file exists when we detect that backup_label has the control file's image. We've certainly had problems in the past where people would just nuke the backup_label file, even though they were restoring from a backup, because they couldn't start the system up since their restore command wasn't set up properly or their WAL archive was missing. Being able to get rid of the control file being in the backup at all would make it harder for someone to get to a corrupted-but-running system and that seems like it's a good thing. > Here's a new minimal patch that solves only the bugs in basebackup + > the simple SQL-facing functions that read the control file, by simply > acquiring ControlFileLock in the obvious places. This should be > simple enough for back-patching? I don't particularly like fixing this in a way that only works for pg_basebackup and means that the users of other backup tools don't have a solution to this problem. What are we supposed to tell users of pgBackRest when they see this fix for pg_basebackup in the next set of minor releases and they ask us if we've addressed this risk? We might be able to accept the 're-read on CRC failure' approach, if it were being used for pg_controldata and we documented that external tools and especially backup tools using the low-level API are required to check the CRC and to re-read on failure if accessing a running system. While it's not ideal, maybe we could get away with changing the contents of the backup_label as part of a back-patch? The documentation, at least on a quick look, says to copy the second field from pg_backup_stop into a backup_label file but doesn't say anything about what those contents are or if they can change. That would at least address the concern of backups ending up with a corrupt pg_control file and not being able to be restored, even if tools aren't updated to verify the CRC or similar. Of course, that's a fair bit of code churn for a backpatch, which I certainly understand as being risky. If we can't back-patch that, it might still be the way to go moving forward, while also telling tools to check the CRC. (I'm not going to try to figure out some back-patchable pretend solution for this for shell scripts that pretend to be able to backup running PG databases; this issue is probably the least of their issues anyway...) A couple of interesting notes on this though- pgBackRest doesn't only read the pg_control file at backup time, we also check it at archive_command time, to make sure that the system ID and version that are in the control file match up with the information in the WAL file we're getting ready to archive and that those match with the system ID and version of the repo/stanza into which we are pushing the WAL file. We do read the control file on the replica but that isn't the one we actually push into the repo as part of a backup- that's always the one we read from the primary (we don't currently support 'backup just from the replica'). Coming out of our discussion regarding this, we're likely to move forward on the check-CRC-and-re-read approach for the next pgBackRest release. If PG provides a better solution for us to use, great, but given that this has been shown to happen, we're not intending to wait around for PG to provide us with a better fix. Thanks, Stephen
signature.asc
Description: PGP signature