On Sat, Aug 5, 2017 at 4:28 PM, Paul A Jungwirth <p...@illuminatedcomputing.com> wrote: > I don't have an opinion on the urgency of back-porting a fix, but if > pg_stop_backup(boolean) allows for inconsistent backups, it does sound > like a problem on 9.6 too.
It doesn't. The talk about inconsistent backups is, I think, not a very precise way of speaking. All backups taken by copying the data directory are inconsistent until you run recovery for long enough to make them consistent. What we're talking about is whether it's guaranteed that all WAL segments needed to reach consistency will be archived by the time pg_stop_backup() returns. However, even if they're not, it doesn't mean that there's anything wrong with your backup per se; it just means you need to replay WAL files that were not in the archive at the time pg_stop_backup() completed when restoring it. So, for example, if you keep an archive of all of your WAL files for PITR purposes, you're fine. IIUC, to have a problem, you have to do the following: 1. Take a backup from the standby, not the master. 2. Take a backup using pg_start_backup() and pg_stop_backup(), not pg_basebackup. 3. Instead of keeping all of your WAL files, just keep the absolute minimum number required to restore the backup. 4. Instead of keeping the WAL files through the LSN returned by pg_stop_backup(), only keep the ones that were archived before pg_stop_backup() returned. All of (1)-(3) are legitimate user choices, although not everyone will make them. (4) is unfortunately the procedure recommended by our documentation, which is where the problem comes in. I think it's pretty lame that the documentation recommends ignoring the return value of pg_stop_backup(); ISTM that it would be very wise to recommend cross-checking the return value against the WAL files you're keeping for the backup. Even if we thought the waiting logic was working perfectly in every case, having a cross-check on something as important as backup integrity seems like an awfully good idea. For example, *even if* we were to apply the patch Michael Paquier proposed, it doesn't actually do anything except emit a warning when archive_mode isn't set to always, and that warning could easily be missed by an automated backup script, and archive_mode=always is probably not a common setting. So you still have the same problem in most cases. I think the root of this problem is that commit 7117685461af50f50c03f43e6a622284c8d54694 did only a pretty perfunctory update of the documentation, as the commit message itself admits: Only reference documentation is included. The main section on backup still needs to be rewritten to cover this, but since that is already scheduled for a separate large rewrite, it's not included in this patch. But it doesn't look like that ever really got done. https://www.postgresql.org/docs/10/static/continuous-archiving.html#backup-lowlevel-base-backup really doesn't talk at all about standbys or differences in required procedures between masters and standbys. It makes statements that are flagrantly riduculous in the case of a standby, like claiming that pg_start_backup() always performs a checkpoint and that pg_stop_backup "terminates the backup mode and performs an automatic switch to the next WAL segment." Well, obviously not. And at least to me, that's the real bug here. Somebody needs to go through and fix this documentation properly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers