Hello, I found that pg_basebackup from a replication standby fails after the following steps, on 9.3 and the master.
- start a replication master - start a replication standby - stop the master in the mode other than immediate. pg_basebackup to the standby will fail with the following error.
pg_basebackup: could not get transaction log end position from server: ERROR: could not find any WAL files
The immediate cause is that do_pg_stop_backup returns an ealier LSN to do_pg_start_backup. The backup start point is the redo point of the last executed restart point. And the backup end point is the minRecoveryPoint at the call time. A restart point doesn't update the minRecoveryPoint when it is actually executed. Even though, ControlFile->checkPointCopy is updated to the redo point of the restart point just made. The minRecoveryPoint is too small as the backup end point on this situation. Thit is, end point can go behind the start point. This can be caused by the simple steps above but it also can be occur when pg_basebackup is connected after master's disconnection during a restart point. (With some other timing-dependet condition) So, the following comment in do_pg_stop_backup says as the following seems somewhat wrong.
* We return the current minimum recovery point as the backup end * location. Note that it can be greater than the exact backup end * location if the minimum recovery point is updated after the backup of * pg_control. This is harmless for current uses.
After looking more closely, I found that the minRecoveryPoint tends to be too small as the backup end point, and up to the record at the lastReplayedRecPtr can affect the pages on disk and they can go into the backup just taken. My conclusion here is that do_pg_stop_backup should return lastReplayedRecPtr, not minRecoveryPoint. The attached small patch does this on the master. The first problem is fixed by this for me. Any thoughts? # Sorry, but I'll be offline 'til Monday. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
>From a37a7bfb05d9066676647ade0ecff85ee5cda2c4 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyot...@lab.ntt.co.jp> Date: Thu, 9 Jun 2016 14:56:07 +0900 Subject: [PATCH] Make pg_stop_backup on standby give proper end LSN. pg_basebackup to a replication standby can fail with an error that says no WAL available. This is caused by that the do_pg_stop_backup returns the minRecoveryPoint as the backup end point but it is sometimes too small, that is, going behind backup start point given by do_pg_start_backup. We use lastReplayedEndRecPtr instead. It is the last LSN that may be affect page files on disk, which is suitable for this use. --- src/backend/access/transam/xlog.c | 12 ++++-------- 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index b473f19..438b091 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -10409,9 +10409,9 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p) * required files are available, a user should wait for them to be * archived, or include them into the backup. * - * We return the current minimum recovery point as the backup end + * We return the current last replayed point as the backup end * location. Note that it can be greater than the exact backup end - * location if the minimum recovery point is updated after the backup of + * location if the last replayed point is updated after the backup of * pg_control. This is harmless for current uses. * * XXX currently a backup history file is for informational and debug @@ -10430,6 +10430,8 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p) */ SpinLockAcquire(&XLogCtl->info_lck); recptr = XLogCtl->lastFpwDisableRecPtr; + stoppoint = XLogCtl->lastReplayedEndRecPtr; + stoptli = XLogCtl->lastReplayedTLI; SpinLockRelease(&XLogCtl->info_lck); if (startpoint <= recptr) @@ -10442,12 +10444,6 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p) "Enable full_page_writes and run CHECKPOINT on the master, " "and then try an online backup again."))); - - LWLockAcquire(ControlFileLock, LW_SHARED); - stoppoint = ControlFile->minRecoveryPoint; - stoptli = ControlFile->minRecoveryPointTLI; - LWLockRelease(ControlFileLock); - if (stoptli_p) *stoptli_p = stoptli; return stoppoint; -- 1.8.3.1
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers