Hi,

Thanks for the feedback.

> Should we have XLogArchiveNotify(), writeTimeLineHistory(), and
> writeTimeLineHistoryFile() enable the directory scan instead?  Else,
> we have to exhaustively cover all such code paths, which may be
> difficult to maintain.  Another reason I am bringing this up is that
> my patch for adjusting .ready file creation [0] introduces more
> opportunities for .ready files to be created out-of-order.

XLogArchiveNotify() notifies Archiver when a log segment is ready for
archival by creating a .ready file. This function is being called for each
log segment and placing a call to enable directory scan here will result
in directory scan for each log segment.

We can have writeTimeLineHistory() and writeTimeLineHistoryFile() to
enable directory scan to handle the scenarios related to timeline switch.

However, in other scenarios, I think we have to explicitly call
PgArchEnableDirScan()
to enable directory scan. PgArchEnableDirScan() takes care of waking up
archiver so that the caller of this function need not have to nudge the
archiver.

> +    /*
> +     * This is a fall-back path, check if we are here due to the
unavailability
> +     * of next anticipated log segment or the archiver is being forced to
> +     * perform a full directory scan. Reset the flag in shared memory
only if
> +     * it has been enabled to force a full directory scan and then
proceed with
> +     * directory scan.
> +     */
> +    if (PgArch->dirScan)
> +        PgArch->dirScan = false;

> Why do we need to check that the flag is set before we reset it?  I
> think we could just always reset it since we are about to do a
> directory scan anyway

Yes, I agree.

> > If there is a race condition here with setting the flag, then an
> > alternative design would be to use a counter - either a plain old
> > uint64 or perhaps pg_atomic_uint64 - and have the startup process
> > increment the counter when it wants to trigger a scan. In this design,
> > the archiver would never modify the counter itself, but just remember
> > the last value that it saw. If it later sees a different value it
> > knows that a full scan is required. I think this kind of system is
> > extremely robust against the general class of problems that you're
> > talking about here, but I'm not sure whether we need it, because I'm
> > not sure whether there is a race with just the bool.

> I'm not sure, either.  Perhaps it would at least be worth adding a
> pg_memory_barrier() after setting dirScan to false to avoid the
> scenario I mentioned (which may or may not be possible).  IMO this
> stuff would be much easier to reason about if we used a lock instead,
> even if the synchronization was not strictly necessary.  However, I
> don't want to hold this patch up too much on this point.

There is one possible scenario where it may run into a race condition. If
archiver has just finished archiving all .ready files and the next
anticipated
log segment is not available then in this case archiver takes the fall-back
path to scan directory. It resets the flag before it begins directory scan.
Now, if a directory scan is enabled by a timeline switch or .ready file
created
out of order in parallel to the event that the archiver resets the flag
then this
might result in a race condition. But in this case also archiver is
eventually
going to perform a directory scan and the desired file will be archived as
part
of directory scan. Apart of this I can't think of any other scenario which
may
result into a race condition unless I am missing something.

I have incorporated the suggestions and updated a new patch. PFA patch v9.

Thanks,
Dipesh
From 04312fa668a56d9860547a02260d68c339747fa8 Mon Sep 17 00:00:00 2001
From: Dipesh Pandit <dipesh.pan...@enterprisedb.com>
Date: Wed, 30 Jun 2021 14:05:58 +0530
Subject: [PATCH] mitigate directory scan for WAL archiver

WAL archiver scans the status directory to identify the next WAL file
that needs to be archived. This directory scan can be minimised by
maintaining the log segment number of current file which is being
archived and incrementing it by '1' to get the next WAL file.
Archiver can check the availability of next file and in case if the
file is not available then it should fall-back to directory scan to
get the oldest WAL file.

If there is a timeline switch then archiver performs a full directory
scan to make sure that archiving history file takes precedence over
archiving WAL files on older timeline.
---
 src/backend/access/transam/timeline.c    |  10 ++
 src/backend/access/transam/xlogarchive.c |  11 +++
 src/backend/postmaster/pgarch.c          | 162 ++++++++++++++++++++++++++++---
 src/include/postmaster/pgarch.h          |   1 +
 4 files changed, 171 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 8d0903c..f70cede 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -453,6 +453,9 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	{
 		TLHistoryFileName(histfname, newTLI);
 		XLogArchiveNotify(histfname);
+
+		/* Notify archiver to enable directory scan */
+		PgArchEnableDirScan();
 	}
 }
 
@@ -525,6 +528,13 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 	 * overwriting an existing file (there shouldn't be one).
 	 */
 	durable_rename_excl(tmppath, path, ERROR);
+
+	/*
+	 * Notify archiver to enable directory scan to archive history file
+	 * immediately.
+	 */
+	if (XLogArchivingActive())
+		PgArchEnableDirScan();
 }
 
 /*
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 26b023e..d03c074 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -609,6 +609,17 @@ XLogArchiveCheckDone(const char *xlog)
 
 	/* Retry creation of the .ready file */
 	XLogArchiveNotify(xlog);
+
+	/*
+	 * This .ready file is created out of order, notify archiver to perform
+	 * a full directory scan to archive corresponding WAL file.
+	 */
+	StatusFilePath(archiveStatusPath, xlog, ".ready");
+	if (stat(archiveStatusPath, &stat_buf) == 0)
+	{
+		PgArchEnableDirScan();
+	}
+
 	return false;
 }
 
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 74a7d7c..0bf6309 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -76,8 +76,25 @@
 typedef struct PgArchData
 {
 	int			pgprocno;		/* pgprocno of archiver process */
+
+	/*
+	 * Flag to enable/disable directory scan. If this flag is set then it
+	 * forces archiver to perform a full directory scan to get the next log
+	 * segment. It is not required to synchronize this flag as it guarantees
+	 * directory scan for the next cycle even if it is being missed in current
+	 * cycle.
+	 */
+	bool		dirScan;
 } PgArchData;
 
+/*
+ * Segment number and timeline ID to identify the next file in a WAL sequence
+ */
+typedef struct readyXLogState
+{
+	XLogSegNo	lastSegNo;
+	TimeLineID	lastTLI;
+} readyXLogState;
 
 /* ----------
  * Local data
@@ -97,9 +114,9 @@ static volatile sig_atomic_t ready_to_stop = false;
  */
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
-static void pgarch_ArchiverCopyLoop(void);
+static void pgarch_ArchiverCopyLoop(readyXLogState *xlogState);
 static bool pgarch_archiveXlog(char *xlog);
-static bool pgarch_readyXlog(char *xlog);
+static bool pgarch_readyXlog(char *xlog, readyXLogState *xlogState);
 static void pgarch_archiveDone(char *xlog);
 static void pgarch_die(int code, Datum arg);
 static void HandlePgArchInterrupts(void);
@@ -221,6 +238,18 @@ PgArchWakeup(void)
 		SetLatch(&ProcGlobal->allProcs[arch_pgprocno].procLatch);
 }
 
+/*
+ * Set dirScan flag in shared memory. Backend notifies archiver in case if an
+ * action requires full directory scan to get the next log segment.
+ */
+void
+PgArchEnableDirScan(void)
+{
+	PgArch->dirScan = true;
+
+	/* Wake up archiver */
+	PgArchWakeup();
+}
 
 /* SIGUSR2 signal handler for archiver process */
 static void
@@ -243,10 +272,21 @@ pgarch_waken_stop(SIGNAL_ARGS)
 static void
 pgarch_MainLoop(void)
 {
+	readyXLogState xlogState;
 	pg_time_t	last_copy_time = 0;
 	bool		time_to_stop;
 
 	/*
+	 * Initialize xlogState, segment number and TLI will be reset/updated in
+	 * function pgarch_readyXlog() for each cycle.
+	 */
+	xlogState.lastSegNo = 0;
+	xlogState.lastTLI = 0;
+
+	/* First cycle after startup */
+	PgArchEnableDirScan();
+
+	/*
 	 * There shouldn't be anything for the archiver to do except to wait for a
 	 * signal ... however, the archiver exists to protect our data, so she
 	 * wakes up occasionally to allow herself to be proactive.
@@ -280,7 +320,7 @@ pgarch_MainLoop(void)
 		}
 
 		/* Do what we're here for */
-		pgarch_ArchiverCopyLoop();
+		pgarch_ArchiverCopyLoop(&xlogState);
 		last_copy_time = time(NULL);
 
 		/*
@@ -321,7 +361,7 @@ pgarch_MainLoop(void)
  * Archives all outstanding xlogs then returns
  */
 static void
-pgarch_ArchiverCopyLoop(void)
+pgarch_ArchiverCopyLoop(readyXLogState *xlogState)
 {
 	char		xlog[MAX_XFN_CHARS + 1];
 
@@ -331,7 +371,7 @@ pgarch_ArchiverCopyLoop(void)
 	 * some backend will add files onto the list of those that need archiving
 	 * while we are still copying earlier archives
 	 */
-	while (pgarch_readyXlog(xlog))
+	while (pgarch_readyXlog(xlog, xlogState))
 	{
 		int			failures = 0;
 		int			failures_orphan = 0;
@@ -596,29 +636,104 @@ pgarch_archiveXlog(char *xlog)
  * larger ID; the net result being that past timelines are given higher
  * priority for archiving.  This seems okay, or at least not obviously worth
  * changing.
+ *
+ * WAL files are generated in a specific order of log segment number. The
+ * directory scan for each WAL file can be minimised by identifying the next
+ * WAL file in the sequence. This can be achieved by maintaining log segment
+ * number and timeline ID corresponding to WAL file currently being archived.
+ * The log segment number of current WAL file can be incremented by '1' to
+ * point to the next WAL file in a sequence. Full directory scan can be avoided
+ * by checking the availability of next WAL file. "xlogState" specifies the
+ * segment number and timeline ID corresponding to the next WAL file.
+ *
+ * However, a full directory scan is performed in some special cases where it
+ * requires us to archive files which takes precedence over the next anticipated
+ * log segment. For example, history file takes precedence over archiving WAL
+ * files on older timeline or an older WAL file which is being left out because
+ * corresponding .ready file is created out of order.
+ *
+ * Returns "true" if a segment is ready for archival, "xlog" represents the
+ * name of the segment.
  */
 static bool
-pgarch_readyXlog(char *xlog)
+pgarch_readyXlog(char *xlog, readyXLogState *xlogState)
 {
-	/*
-	 * open xlog status directory and read through list of xlogs that have the
-	 * .ready suffix, looking for earliest file. It is possible to optimise
-	 * this code, though only a single file is expected on the vast majority
-	 * of calls, so....
-	 */
+	char		basename[MAX_XFN_CHARS + 1];
+	char		xlogready[MAXPGPATH];
 	char		XLogArchiveStatusDir[MAXPGPATH];
 	DIR		   *rldir;
 	struct dirent *rlde;
+	struct stat	st;
 	bool		found = false;
 	bool		historyFound = false;
 
+	/*
+	 * Skip directory scan until it is not indicated by shared memory flag
+	 * dirScan.
+	 */
+	if (!PgArch->dirScan)
+	{
+		/*
+		 * We already have the next anticipated log segment and timeline, check
+		 * if this WAL is ready to be archived.
+		 */
+		XLogFileName(basename, xlogState->lastTLI, xlogState->lastSegNo, wal_segment_size);
+		StatusFilePath(xlogready, basename, ".ready");
+
+		if (stat(xlogready, &st) == 0)
+		{
+			strcpy(xlog, basename);
+
+			/*
+			 * Increment the readyXLogState's lastSegNo to point to the next
+			 * WAL file. Although we have not yet archived the current WAL file
+			 * and readyXLogState points to the next WAL file, this is safe
+			 * because the next cycle will not begin until we finish archiving
+			 * current WAL file.
+			 */
+			xlogState->lastSegNo++;
+			return true;
+		}
+	}
+
+	/*
+	 * This is a fall-back path which indicates that either the next
+	 * anticipated log segment in unavailable or archiver is being forced to
+	 * perform a full directory scan. In any case reset the flag in shared
+	 * memory and then proceed with directory scan.
+	 */
+	PgArch->dirScan = false;
+
+	/*
+	 * Make sure that the flag reset is flushed to memory before it is examined
+	 * or set again.
+	 */
+	pg_memory_barrier();
+
+	/*
+	 * Perform a full directory scan to identify the next log segment. There
+	 * may be one of the following scenarios which may require us to perform a
+	 * full directory scan.
+	 *
+	 * - This is the first cycle since archiver has started and there is no
+	 *   idea about the next anticipated log segment.
+	 *
+	 * - There is a timeline switch, archive history file as part of this
+	 *   timeline switch.
+	 *
+	 * - .ready file is created out of order.
+	 *
+	 * - The next anticipated log segment is not available.
+	 *
+	 * open xlog status directory and read through list of xlogs that have the
+	 * .ready suffix, looking for earliest file.
+	 */
 	snprintf(XLogArchiveStatusDir, MAXPGPATH, XLOGDIR "/archive_status");
 	rldir = AllocateDir(XLogArchiveStatusDir);
 
 	while ((rlde = ReadDir(rldir, XLogArchiveStatusDir)) != NULL)
 	{
 		int			basenamelen = (int) strlen(rlde->d_name) - 6;
-		char		basename[MAX_XFN_CHARS + 1];
 		bool		ishistory;
 
 		/* Ignore entries with unexpected number of characters */
@@ -661,6 +776,27 @@ pgarch_readyXlog(char *xlog)
 				strcpy(xlog, basename);
 		}
 	}
+
+	if (found)
+	{
+		if (!historyFound)
+		{
+			/*
+			 * Reset segment number and timeline ID as this is the beginning of a
+			 * new sequence.
+			 */
+			XLogFromFileName(xlog, &xlogState->lastTLI, &xlogState->lastSegNo,
+					wal_segment_size);
+
+			/* Increment log segment number to point to the next WAL file */
+			xlogState->lastSegNo++;
+		}
+
+		ereport(LOG,
+				(errmsg("directory scan to archive write-ahead log file \"%s\"",
+						xlog)));
+	}
+
 	FreeDir(rldir);
 
 	return found;
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index 1e47a14..265f7e5 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -31,5 +31,6 @@ extern void PgArchShmemInit(void);
 extern bool PgArchCanRestart(void);
 extern void PgArchiverMain(void) pg_attribute_noreturn();
 extern void PgArchWakeup(void);
+extern void PgArchEnableDirScan(void);
 
 #endif							/* _PGARCH_H */
-- 
1.8.3.1

Reply via email to