David, * David Steele (da...@pgmasters.net) wrote: > On 7/23/17 11:48 PM, Masahiko Sawada wrote: > >On Sat, Jul 22, 2017 at 8:04 AM, Stephen Frost <sfr...@snowman.net> wrote: > >> > >>I started discussing this with David off-list and he'd like a chance to > >>review it in a bit more detail (he's just returning from being gone for > >>a few weeks). That review will be posted to this thread on Monday, and > >>I'll wait until then to move forward with the patch. > > Before reviewing the patch, I would note that it looks like this > issue was introduced in b6a323a8c before the 9.6 release. The > documentation states that the pg_stop_backup() function will wait > for all required WAL to be archived, which corresponds to the > default in the new patch of waitforarchive = true. The commit notes > that the documentation needs updating, but since that didn't happen > I think we should consider this a bug in 9.6 and back patch.
I tend to agree with this. The documentation clearly says that pg_stop_backup() waits for all WAL to be archived, but, today, if it's run on a standby then it doesn't wait, which might lead to invalid backups if the backup software is depending on that. > While this patch brings pg_stop_backup() in line with the > documentation, it also introduces a behavioral change compared to > 9.6. Currently, the default 9.6 behavior on a standby is to return > immediately from pg_stop_backup(), but this patch will cause it to > block by default instead. Since action on the master may be > required to unblock the process, I see this as a pretty significant > change. I still think we should fix and backpatch, but I wanted to > point this out. This will need to be highlighted in the release notes for 9.6.4 also, assuming there is agreement to back-patch this, and we'll need to update the documentation in 9.6 also to be clearer about what happens on a standby. > The patch looks sensible to me. A few comments: > > 1) I would change: > > "Check if the WAL segment needed for this backup have been > completed, in which case a forced segment switch may be needed on > the primary." > > To something like: > > "If the WAL segments required for this backup have not been archived > then it might be necessary to force a segment switch on the > primary." I'm a bit on the fence regarding the distinction here for the backup-from-standby and this errhint(). The issue is that the WAL for the backup hasn't been archived yet and that may be because of either: archive_command is failing OR No WAL is getting generated In either case, both of these apply to the primary and the standby. As such, I'm inclined to try to mention both in the original errhint() instead of making two different ones. I'm open to suggestions on this, of course, but my thinking would be more like: ----- Either archive_command is failing or not enough WAL has been generated to require a segment switch. Run pg_switch_wal() to request a WAL switch and monitor your logs to check that your archive_command is executing properly. ----- And then change pg_switch_wal()'s errhint for when it's run on a replica to be: ---- HINT: WAL control functions cannont be executed during recovery; they should be executed on the primary instead. ---- (or similar, again, open to suggestions here). > 2) At backup.sgml L1015 it says that pg_stop_backup() will > automatically switch the WAL segment. There should be a caveat here > for standby backups, like: > > When called on a master it terminates the backup mode and performs > an automatic switch to the next WAL segment. The reason for the > switch is to arrange for the last WAL segment written during the > backup interval to be ready to archive. When called on a standby > the WAL segment switch must be performed manually on the master if > it does not happen due to normal write activity. s/master/primary/g Otherwise it looks alright.. Might try to reword to use similar language as to what we have today in 184.108.40.206. > 3) The fact that this fix requires "archive_mode = always" seems > like a pretty big caveat, thought I don't have any ideas on how to > make it better without big code changes. Maybe it would be help to > change: > > + the backup is taken on a standby, <function>pg_stop_backup</> waits > + for WAL to be archived when <varname>archive_mode</> > > To: > > + the backup is taken on a standby, <function>pg_stop_backup</> waits > + for WAL to be archived *only* when <varname>archive_mode</> I'm thinking of rewording that a bit to say "When archive_mode = always" instead, as I think that might be a little clearer. > Perhaps this should be noted in the pg_basebackup docs as well? That seems like it's probably a good idea too, as people might wonder why pg_basebackup hasn't finished yet if it's waiting for WAL to be archived. Thanks! Stephen
Description: Digital signature