Re: [BUGS] Recovery bug

Heikki Linnakangas Mon, 25 Oct 2010 04:45:05 -0700

On 19.10.2010 22:40, Jeff Davis wrote:

On Tue, 2010-10-19 at 09:51 -0700, Jeff Davis wrote:

On Tue, 2010-10-19 at 12:26 +0300, Heikki Linnakangas wrote:

Excluding pg_xlog is just a recommendation at the moment, though, so we
would need a big warning in the docs. And some way to enforce that
just_kidding is not included in the backup would be nice, maybe we could
remove read-permission from it?


Hmm, removing the read bit would add some confidence into the process. I
like that idea better than just assuming that the user won't copy it.

I think I like this direction most, because it doesn't leave us
guessing. If the file is there then we assume normal recovery. If we
encounter recovery.conf we throw FATAL. If we encounter backup_label we
can simply remove it (perhaps with a warning that there was a crash in
the middle of a backup).


On second thought, this doesn't sound backpatch-friendly. We should
probably put a simpler fix in first and back-patch it. Then we can do
something like your proposal for 9.1. What do you think of my proposed
patch?


Yes, let's go ahead with your original patch.

It seems we should use ReadRecord instead of the lower-levelXLogPageRead function. One difference is that ReadRecord performs abunch of sanity checks on the record, while XLogPageRead just reads theraw page. Extra sanity checking before removing backup_label seems likea good idea. Another difference is that in standby-mode, ReadRecord willretry until it succeeds. A standby server should keep retrying, even thevery first record, until it succeeds, otherwise we have a change inbehavior.

So I'm thinking of the attached patch. I'll give this some testing onolder branches and commit.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 22fd578..1d1a7bc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5839,6 +5839,18 @@ StartupXLOG(void)
 		record = ReadCheckpointRecord(checkPointLoc, 0);
 		if (record != NULL)
 		{
+			/*
+			 * Make sure that REDO location exists. This may not be
+			 * the case if there was a crash during an online backup,
+			 * which left a backup_label around that references a WAL
+			 * segment that's already been archived.
+			 */
+			memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
+			if (!ReadRecord(&(checkPoint.redo), LOG, false))
+				ereport(FATAL,
+						(errmsg("could not find redo location referenced by checkpoint record"),
+						 errhint("If you are not restoring from a backup, try removing the file \"%s/backup_label\".", DataDir)));
+
 			ereport(DEBUG1,
 					(errmsg("checkpoint record is at %X/%X",
 							checkPointLoc.xlogid, checkPointLoc.xrecoff)));
@@ -5846,7 +5858,7 @@ StartupXLOG(void)
 		}
 		else
 		{
-			ereport(PANIC,
+			ereport(FATAL,
 					(errmsg("could not locate required checkpoint record"),
 					 errhint("If you are not restoring from a backup, try removing the file \"%s/backup_label\".", DataDir)));
 		}
@@ -5892,10 +5904,10 @@ StartupXLOG(void)
 				ereport(PANIC,
 					 (errmsg("could not locate a valid checkpoint record")));
 		}
+		memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
 	}
 
 	LastRec = RecPtr = checkPointLoc;
-	memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
 	wasShutdown = (record->xl_info == XLOG_CHECKPOINT_SHUTDOWN);
 
 	ereport(DEBUG1,

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Recovery bug

Reply via email to