Re: Improve WALRead() to suck data directly from WAL buffers when possible

Bharath Rupireddy Tue, 14 Mar 2023 00:59:18 -0700

On Tue, Mar 7, 2023 at 11:14 PM Nathan Bossart <nathandboss...@gmail.com> wrote:
>
> On Tue, Mar 07, 2023 at 12:39:13PM +0530, Bharath Rupireddy wrote:
> > On Tue, Mar 7, 2023 at 3:30 AM Nathan Bossart <nathandboss...@gmail.com> 
> > wrote:
> >> Is it possible to memcpy more than a page at a time?
> >
> > It would complicate things a lot there; the logic to figure out the
> > last page bytes that may or may not fit in the whole page gets
> > complicated. Also, the logic to verify each page's header gets
> > complicated. We might lose out if we memcpy all the pages at once and
> > start verifying each page's header in another loop.
>
> Doesn't the complicated logic you describe already exist to some extent in
> the patch?  You are copying a page at a time, which involves calculating
> various addresses and byte counts.


Okay here I am with the v10 patch set attached that avoids multiple
memcpy calls which must benefit the callers who want to read more than
1 WAL buffer page (streaming replication WAL sender for instance).

> >> +    elog(DEBUG1, "read %zu bytes out of %zu bytes from WAL buffers for 
> >> given LSN %X/%X, Timeline ID %u",
> >> +         *read_bytes, count, LSN_FORMAT_ARGS(startptr), tli);
> >>
> >> I definitely don't think we should put an elog() in this code path.
> >> Perhaps this should be guarded behind WAL_DEBUG.
> >
> > Placing it behind WAL_DEBUG doesn't help users/developers. My
> > intention was to let users know that the WAL read hit the buffers,
> > it'll help them report if any issue occurs and also help developers to
> > debug that issue.
>
> I still think an elog() is mighty expensive for this code path, even when
> it doesn't actually produce any messages.  And when it does, I think it has
> the potential to be incredibly noisy.

Well, my motive was to have a way for the user to know WAL buffer hits
and misses to report any found issues. However, I have a plan later to
add WAL buffer stats (hits/misses). I understand that even if someone
enables DEBUG1, this message can bloat server log files and make
recovery slower, especially on a standby. Hence, I agree to keep these
logs behind the WAL_DEBUG macro like others and did so in the attached
v10 patch set.

Please review the attached v10 patch set further.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

From aa6454d9abb9a70b728dbba7f40279108486a3e4 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Tue, 14 Mar 2023 07:30:09 +0000
Subject: [PATCH v10] Improve WALRead() to suck data directly from WAL buffers

---
 src/backend/access/transam/xlog.c       | 171 ++++++++++++++++++++++++
 src/backend/access/transam/xlogreader.c |  42 +++++-
 src/include/access/xlog.h               |   6 +
 3 files changed, 217 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 543d4d897a..d40b9562e1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1639,6 +1639,177 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Read WAL from WAL buffers.
+ *
+ * Read 'count' bytes of WAL from WAL buffers into 'buf', starting at location
+ * 'startptr', on timeline 'tli' and return total read bytes.
+ *
+ * Note that this function reads as much as it can from WAL buffers, meaning,
+ * it may not read all the requested 'count' bytes. Caller must be aware of
+ * this and deal with it.
+ */
+Size
+XLogReadFromBuffers(XLogReaderState *state PG_USED_FOR_ASSERTS_ONLY,
+					XLogRecPtr startptr,
+					TimeLineID tli,
+					Size count,
+					char *buf)
+{
+	XLogRecPtr	ptr = startptr;
+	Size	nbytes = count;	/* total bytes requested to be read by caller */
+	Size	ntotal = 0;	/* total bytes read */
+	Size	nbatch = 0;	/* bytes to be read in single batch */
+	char	*batchstart = NULL;	/* location to read from for single batch */
+
+	Assert(!XLogRecPtrIsInvalid(startptr));
+	Assert(count > 0);
+	Assert(startptr <= GetFlushRecPtr(NULL));
+	Assert(!RecoveryInProgress());
+	Assert(tli == GetWALInsertionTimeLine());
+
+	/*
+	 * Holding WALBufMappingLock ensures inserters don't overwrite this value
+	 * while we are reading it. We try to acquire it in shared mode so that the
+	 * concurrent WAL readers are also allowed. We try to do as less work as
+	 * possible while holding the lock to not impact concurrent WAL writers
+	 * much. We quickly exit to not cause any contention, if the lock isn't
+	 * immediately available.
+	 */
+	if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED))
+		return ntotal;
+
+	while (nbytes > 0)
+	{
+		XLogRecPtr	expectedEndPtr;
+		XLogRecPtr	endptr;
+		int 	idx;
+		char	*page;
+		char    *data;
+		XLogPageHeader	phdr;
+
+		idx = XLogRecPtrToBufIdx(ptr);
+		expectedEndPtr = ptr;
+		expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+		endptr = XLogCtl->xlblocks[idx];
+
+		/* Requested WAL isn't available in WAL buffers. */
+		if (expectedEndPtr != endptr)
+			break;
+
+		/*
+		 * We found WAL buffer page containing given XLogRecPtr. Get starting
+		 * address of the page and a pointer to the right location of given
+		 * XLogRecPtr in that page.
+		 */
+		page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
+		data = page + ptr % XLOG_BLCKSZ;
+
+		/*
+		 * The fact that we acquire WALBufMappingLock while reading the WAL
+		 * buffer page itself guarantees that no one else initializes it or
+		 * makes it ready for next use in AdvanceXLInsertBuffer(). However, we
+		 * need to ensure that we are not reading a page that just got
+		 * initialized. For this, we looka at the needed page header.
+		 */
+		phdr = (XLogPageHeader) page;
+
+		/* Return, if WAL buffer page doesn't look valid. */
+		if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+			  phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+			  phdr->xlp_tli == tli))
+			break;
+
+		/*
+		 * Note that we don't perform all page header checks here to avoid
+		 * extra work in production builds, callers will anyway do those checks
+		 * extensively. However, in an assert-enabled build, we perform all the
+		 * checks here and raise an error if failed.
+		 */
+#ifdef USE_ASSERT_CHECKING
+		if (state != NULL &&
+			!XLogReaderValidatePageHeader(state, (endptr - XLOG_BLCKSZ),
+										  (char *) phdr))
+			ereport(ERROR,
+					(errcode(ERRCODE_INTERNAL_ERROR),
+					 errmsg_internal("error while reading WAL from WAL buffers: %s", state->errormsg_buf)));
+#endif
+
+		/* Count what is wanted, not the whole page. */
+		if ((data + nbytes) <= (page + XLOG_BLCKSZ))
+		{
+			/* All the bytes are in one page. */
+			nbatch += nbytes;
+			ntotal += nbytes;
+			nbytes = 0;
+		}
+		else
+		{
+			Size	navailable;
+
+			/*
+			 * All the bytes are not in one page. Deduce available bytes on the
+			 * current page, count them and continue to look for remaining
+			 * bytes.
+			 */
+			navailable = XLOG_BLCKSZ - (data - page);
+			Assert(navailable > 0 && navailable <= nbytes);
+			ptr += navailable;
+			nbytes -= navailable;
+			nbatch += navailable;
+			ntotal += navailable;
+		}
+
+		/*
+		 * We avoid multiple memcpy calls while reading WAL. Note that we
+		 * memcpy what we have counted so far whenever we are wrapping around
+		 * WAL buffers (because WAL buffers are organized as cirucular array of
+		 * pages) and continue to look for remaining WAL.
+		 */
+		if (batchstart == NULL)
+		{
+			/* Mark where the data in WAL buffers starts from. */
+			batchstart = data;
+		}
+
+		/*
+		 * We are wrapping around WAL buffers, so read what we have counted so
+		 * far.
+		 */
+		if (idx == XLogCtl->XLogCacheBlck)
+		{
+			Assert(batchstart != NULL);
+			Assert(nbatch > 0);
+
+			memcpy(buf, batchstart, nbatch);
+			buf += nbatch;
+
+			/* Reset for next batch. */
+			batchstart = NULL;
+			nbatch = 0;
+		}
+	}
+
+	/* Read what we have counted so far. */
+	Assert(nbatch <= ntotal);
+	if (batchstart != NULL && nbatch > 0)
+		memcpy(buf, batchstart, nbatch);
+
+	LWLockRelease(WALBufMappingLock);
+
+	/* We never read more than what the caller has asked for. */
+	Assert(ntotal <= count);
+
+#ifdef WAL_DEBUG
+	if (XLOG_DEBUG)
+		ereport(DEBUG1,
+				(errmsg_internal("read %zu bytes out of %zu bytes from WAL buffers for given LSN %X/%X, Timeline ID %u",
+								  ntotal, count, LSN_FORMAT_ARGS(startptr), tli)));
+#endif
+
+	return ntotal;
+}
+
 /*
  * Converts a "usable byte position" to XLogRecPtr. A usable byte position
  * is the position starting from the beginning of WAL, excluding all WAL
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cadea21b37..03f0cca1e6 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1486,8 +1486,7 @@ err:
  * Returns true if succeeded, false if an error occurs, in which case
  * 'errinfo' receives error details.
  *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
+ * When possible, this function reads data directly from WAL buffers.
  */
 bool
 WALRead(XLogReaderState *state,
@@ -1498,6 +1497,45 @@ WALRead(XLogReaderState *state,
 	XLogRecPtr	recptr;
 	Size		nbytes;
 
+#ifndef FRONTEND
+	/* Frontend tools have no idea of WAL buffers. */
+	Size        nread;
+
+	/*
+	 * Try reading WAL from WAL buffers. We skip this step and continue the
+	 * usual way, that is to read from WAL file, either when server is in
+	 * recovery (standby mode, archive or crash recovery), in which case the
+	 * WAL buffers are not used or when the server is inserting in a different
+	 * timeline from that of the timeline that we're trying to read WAL from.
+	 */
+	if (!RecoveryInProgress() &&
+		tli == GetWALInsertionTimeLine())
+	{
+		nread = XLogReadFromBuffers(state, startptr, tli, count, buf);
+
+		Assert(nread >= 0);
+
+		/*
+		 * Check if we have read fully (hit), partially (partial hit) or
+		 * nothing (miss) from WAL buffers. If we have read either partially or
+		 * nothing, then continue to read the remaining bytes the usual way,
+		 * that is, read from WAL file.
+		 */
+		if (count == nread)
+			return true;	/* Buffer hit, so return. */
+		else if (count > nread)
+		{
+			/*
+			 * Buffer partial hit, so reset the state to count the read bytes
+			 * and continue.
+			 */
+			buf += nread;
+			startptr += nread;
+			count -= nread;
+		}
+	}
+#endif	/* FRONTEND */
+
 	p = buf;
 	recptr = startptr;
 	nbytes = count;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..4fdd8c8b17 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -247,6 +247,12 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern Size XLogReadFromBuffers(struct XLogReaderState *state,
+								XLogRecPtr startptr,
+								TimeLineID tli,
+								Size count,
+								char *buf);
+
 /*
  * Routines used by xlogrecovery.c to call back into xlog.c during recovery.
  */
-- 
2.34.1

From 9a7abe263123266cfc7fccd259570164ecdab047 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Tue, 14 Mar 2023 07:14:01 +0000
Subject: [PATCH v10] Add test module for verifying WAL read from WAL buffers

---
 src/test/modules/Makefile                     |  1 +
 src/test/modules/meson.build                  |  1 +
 .../test_wal_read_from_buffers/.gitignore     |  4 ++
 .../test_wal_read_from_buffers/Makefile       | 23 +++++++++
 .../test_wal_read_from_buffers/meson.build    | 36 +++++++++++++
 .../test_wal_read_from_buffers/t/001_basic.pl | 44 ++++++++++++++++
 .../test_wal_read_from_buffers--1.0.sql       | 16 ++++++
 .../test_wal_read_from_buffers.c              | 51 +++++++++++++++++++
 .../test_wal_read_from_buffers.control        |  4 ++
 9 files changed, 180 insertions(+)
 create mode 100644 src/test/modules/test_wal_read_from_buffers/.gitignore
 create mode 100644 src/test/modules/test_wal_read_from_buffers/Makefile
 create mode 100644 src/test/modules/test_wal_read_from_buffers/meson.build
 create mode 100644 src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..ea33361f69 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -33,6 +33,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_wal_read_from_buffers \
 		  unsafe_tests \
 		  worker_spi
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..e3ffd3538d 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -29,5 +29,6 @@ subdir('test_regex')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_wal_read_from_buffers')
 subdir('unsafe_tests')
 subdir('worker_spi')
diff --git a/src/test/modules/test_wal_read_from_buffers/.gitignore b/src/test/modules/test_wal_read_from_buffers/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_wal_read_from_buffers/Makefile b/src/test/modules/test_wal_read_from_buffers/Makefile
new file mode 100644
index 0000000000..7a09533ec7
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_wal_read_from_buffers/Makefile
+
+MODULE_big = test_wal_read_from_buffers
+OBJS = \
+	$(WIN32RES) \
+	test_wal_read_from_buffers.o
+PGFILEDESC = "test_wal_read_from_buffers - test module to verify that WAL can be read from WAL buffers"
+
+EXTENSION = test_wal_read_from_buffers
+DATA = test_wal_read_from_buffers--1.0.sql
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_wal_read_from_buffers
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_wal_read_from_buffers/meson.build b/src/test/modules/test_wal_read_from_buffers/meson.build
new file mode 100644
index 0000000000..40a36edc07
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/meson.build
@@ -0,0 +1,36 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+# FIXME: prevent install during main install, but not during test :/
+
+test_wal_read_from_buffers_sources = files(
+  'test_wal_read_from_buffers.c',
+)
+
+if host_system == 'windows'
+  test_wal_read_from_buffers_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_wal_read_from_buffers',
+    '--FILEDESC', 'test_wal_read_from_buffers - test module to verify that WAL can be read from WAL buffers',])
+endif
+
+test_wal_read_from_buffers = shared_module('test_wal_read_from_buffers',
+  test_wal_read_from_buffers_sources,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_wal_read_from_buffers
+
+install_data(
+  'test_wal_read_from_buffers.control',
+  'test_wal_read_from_buffers--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_wal_read_from_buffers',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_basic.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
new file mode 100644
index 0000000000..3448e0bed6
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
@@ -0,0 +1,44 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+
+$node->init;
+
+# Ensure nobody interferes with us so that the WAL in WAL buffers don't get
+# overwritten while running tests.
+$node->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = 1h
+wal_writer_delay = 10000ms
+wal_writer_flush_after = 1GB
+));
+$node->start;
+
+# Setup.
+$node->safe_psql('postgres', 'CREATE EXTENSION test_wal_read_from_buffers');
+
+# Get current insert LSN. After this, we generate some WAL which is guranteed
+# to be in WAL buffers as there is no other WAL generating activity is
+# happening on the server. We then verify if we can read the WAL from WAL
+# buffers using this LSN.
+my $lsn =
+  $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+
+# Generate minimal WAL so that WAL buffers don't get overwritten.
+$node->safe_psql('postgres',
+	"CREATE TABLE t (c int); INSERT INTO t VALUES (1);");
+
+# Check if WAL is successfully read from WAL buffers.
+my $result = $node->safe_psql('postgres',
+	qq{SELECT test_wal_read_from_buffers('$lsn')});
+is($result, 't', "WAL is successfully read from WAL buffers");
+
+done_testing();
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
new file mode 100644
index 0000000000..8e89910133
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
@@ -0,0 +1,16 @@
+/* src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_wal_read_from_buffers" to load this file. \quit
+
+--
+-- test_wal_read_from_buffers()
+--
+-- Returns true if WAL data at a given LSN can be read from WAL buffers.
+-- Otherwise returns false.
+--
+CREATE FUNCTION test_wal_read_from_buffers(IN lsn pg_lsn,
+    OUT read_from_buffers bool
+)
+AS 'MODULE_PATHNAME', 'test_wal_read_from_buffers'
+LANGUAGE C STRICT PARALLEL UNSAFE;
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
new file mode 100644
index 0000000000..d0942658da
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
@@ -0,0 +1,51 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_wal_read_from_buffers.c
+ *		Test code for veryfing WAL read from WAL buffers.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "fmgr.h"
+#include "utils/pg_lsn.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * SQL function for verifying that WAL data at a given LSN can be read from WAL
+ * buffers. Returns true if read from WAL buffers, otherwise false.
+ */
+PG_FUNCTION_INFO_V1(test_wal_read_from_buffers);
+Datum
+test_wal_read_from_buffers(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	lsn;
+	Size	nread;
+	TimeLineID	tli;
+	char	data[XLOG_BLCKSZ] = {0};
+
+	lsn = PG_GETARG_LSN(0);
+
+	if (XLogRecPtrIsInvalid(lsn))
+		PG_RETURN_BOOL(false);
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("WAL control functions cannot be executed during recovery.")));
+
+	tli = GetWALInsertionTimeLine();
+
+	nread = XLogReadFromBuffers(NULL, lsn, tli, XLOG_BLCKSZ, data);
+
+	PG_RETURN_LSN(nread > 0);
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
new file mode 100644
index 0000000000..7852b3e331
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
@@ -0,0 +1,4 @@
+comment = 'Test code for veryfing WAL read from WAL buffers'
+default_version = '1.0'
+module_pathname = '$libdir/test_wal_read_from_buffers'
+relocatable = true
-- 
2.34.1

Re: Improve WALRead() to suck data directly from WAL buffers when possible

Reply via email to