Hello, I found that pg_basebackup from a replication standby
fails after the following steps, on 9.3 and the master.

- start a replication master
- start a replication standby
- stop the master in the mode other than immediate.

pg_basebackup to the standby will fail with the following error.

pg_basebackup: could not get transaction log end position from server: ERROR: could not find any WAL files

The immediate cause is that do_pg_stop_backup returns an ealier
LSN to do_pg_start_backup. The backup start point is the redo
point of the last executed restart point. And the backup end
point is the minRecoveryPoint at the call time.

A restart point doesn't update the minRecoveryPoint when it is
actually executed. Even though, ControlFile->checkPointCopy is
updated to the redo point of the restart point just made. The
minRecoveryPoint is too small as the backup end point on this
situation. Thit is, end point can go behind the start point.

This can be caused by the simple steps above but it also can be
occur when pg_basebackup is connected after master's
disconnection during a restart point. (With some other
timing-dependet condition)

So, the following comment in do_pg_stop_backup says as the
following seems somewhat wrong.

* We return the current minimum recovery point as the backup end
* location. Note that it can be greater than the exact backup end
* location if the minimum recovery point is updated after the backup of
* pg_control. This is harmless for current uses.

After looking more closely, I found that the minRecoveryPoint
tends to be too small as the backup end point, and up to the
record at the lastReplayedRecPtr can affect the pages on disk and
they can go into the backup just taken.

My conclusion here is that do_pg_stop_backup should return
lastReplayedRecPtr, not minRecoveryPoint.

The attached small patch does this on the master. The first
problem is fixed by this for me.

Any thoughts?

# Sorry, but I'll be offline 'til Monday.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center
>From a37a7bfb05d9066676647ade0ecff85ee5cda2c4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyot...@lab.ntt.co.jp>
Date: Thu, 9 Jun 2016 14:56:07 +0900
Subject: [PATCH] Make pg_stop_backup on standby give proper end LSN.

pg_basebackup to a replication standby can fail with an error that
says no WAL available. This is caused by that the do_pg_stop_backup
returns the minRecoveryPoint as the backup end point but it is
sometimes too small, that is, going behind backup start point given by
do_pg_start_backup.

We use lastReplayedEndRecPtr instead. It is the last LSN that may be
affect page files on disk, which is suitable for this use.
---
 src/backend/access/transam/xlog.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b473f19..438b091 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10409,9 +10409,9 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
 	 * required files are available, a user should wait for them to be
 	 * archived, or include them into the backup.
 	 *
-	 * We return the current minimum recovery point as the backup end
+	 * We return the current last replayed point as the backup end
 	 * location. Note that it can be greater than the exact backup end
-	 * location if the minimum recovery point is updated after the backup of
+	 * location if the last replayed point is updated after the backup of
 	 * pg_control. This is harmless for current uses.
 	 *
 	 * XXX currently a backup history file is for informational and debug
@@ -10430,6 +10430,8 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
 		 */
 		SpinLockAcquire(&XLogCtl->info_lck);
 		recptr = XLogCtl->lastFpwDisableRecPtr;
+		stoppoint = XLogCtl->lastReplayedEndRecPtr;
+		stoptli = XLogCtl->lastReplayedTLI;
 		SpinLockRelease(&XLogCtl->info_lck);
 
 		if (startpoint <= recptr)
@@ -10442,12 +10444,6 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
 				 "Enable full_page_writes and run CHECKPOINT on the master, "
 					 "and then try an online backup again.")));
 
-
-		LWLockAcquire(ControlFileLock, LW_SHARED);
-		stoppoint = ControlFile->minRecoveryPoint;
-		stoptli = ControlFile->minRecoveryPointTLI;
-		LWLockRelease(ControlFileLock);
-
 		if (stoptli_p)
 			*stoptli_p = stoptli;
 		return stoppoint;
-- 
1.8.3.1

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to