The following patch has been independently verified as fixing bug 4137. The problem was that at the very start of archive recovery the %r parameter in restore_command could be set to a filename later than the currently requested filename (%f). This could lead to early truncation of the archived WAL files and would cause warm standby replication to fail soon afterwards, in certain specific circumstances.
Fix applied to both core server in generating correct %r filenames and also to pg_standby to prevent acceptance of out-of-sequence filenames. Request review and commit. No port specific details. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Index: contrib/pg_standby/pg_standby.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/contrib/pg_standby/pg_standby.c,v retrieving revision 1.10 diff -c -r1.10 pg_standby.c *** contrib/pg_standby/pg_standby.c 15 Nov 2007 21:14:30 -0000 1.10 --- contrib/pg_standby/pg_standby.c 3 May 2008 11:27:12 -0000 *************** *** 297,302 **** --- 297,311 ---- if (restartWALFileName) { + /* + * Don't do cleanup if the restartWALFileName provided + * is later than the xlog file requested. This is an error + * and we must not remove these files from archive. + * This shouldn't happen, but better safe than sorry. + */ + if (strcmp(restartWALFileName, nextWALFileName) > 0) + return false; + strcpy(exclusiveCleanupFileName, restartWALFileName); return true; } *************** *** 584,590 **** fprintf(stderr, "\nMax wait interval : %d %s", maxwaittime, (maxwaittime > 0 ? "seconds" : "forever")); fprintf(stderr, "\nCommand for restore : %s", restoreCommand); ! fprintf(stderr, "\nKeep archive history : %s and later", exclusiveCleanupFileName); fflush(stderr); } --- 593,603 ---- fprintf(stderr, "\nMax wait interval : %d %s", maxwaittime, (maxwaittime > 0 ? "seconds" : "forever")); fprintf(stderr, "\nCommand for restore : %s", restoreCommand); ! fprintf(stderr, "\nKeep archive history : "); ! if (need_cleanup) ! fprintf(stderr, "%s and later", exclusiveCleanupFileName); ! else ! fprintf(stderr, "No cleanup required"); fflush(stderr); } Index: src/backend/access/transam/xlog.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/xlog.c,v retrieving revision 1.300 diff -c -r1.300 xlog.c *** src/backend/access/transam/xlog.c 24 Apr 2008 14:23:43 -0000 1.300 --- src/backend/access/transam/xlog.c 3 May 2008 10:58:08 -0000 *************** *** 2484,2489 **** --- 2484,2509 ---- } /* + * Calculate the archive file cutoff point. All files + * before this file can be deleted from the archive. + * The filename is not inclusive, so you must not remove + * this filename from the archive; it will definitely be + * required. Most of the time this will be the last + * restartpoint, but at the start of archive recovery it + * will be the file containing the checkpoint written by + * pg_start_backup(). We do the calculation now so that + * %r is always less than or equal to %f before we start + * to construct the command to be executed + */ + XLByteToSeg(ControlFile->checkPointCopy.redo, + restartLog, restartSeg); + XLogFileName(lastRestartPointFname, + ControlFile->checkPointCopy.ThisTimeLineID, + restartLog, restartSeg); + if (strcmp(lastRestartPointFname, xlogfname) > 0) + strcpy(lastRestartPointFname, xlogfname); + + /* * construct the command to be executed */ dp = xlogRestoreCmd; *************** *** 2512,2522 **** case 'r': /* %r: filename of last restartpoint */ sp++; - XLByteToSeg(ControlFile->checkPointCopy.redo, - restartLog, restartSeg); - XLogFileName(lastRestartPointFname, - ControlFile->checkPointCopy.ThisTimeLineID, - restartLog, restartSeg); StrNCpy(dp, lastRestartPointFname, endp - dp); dp += strlen(dp); break; --- 2532,2537 ----
-- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches