On Tue, Apr 09, 2024 at 10:25:36AM +0000, Hayato Kuroda (Fujitsu) wrote:
>> On Fri, Apr 05, 2024 at 06:22:58AM +0000, Hayato Kuroda (Fujitsu) wrote:
>> I'm slightly annoyed by the fact that there is no check on
>> is_prim_bucket_same_wrt for buffer 1 in the BLK_NEEDS_REDO case to
>> show the symmetry between the insert and replay paths.  Shouldn't
>> there be at least an assert for that in the branch when there are no
>> tuples (imagine an else to cover xldata->ntups == 0)?  I mean between
>> just before updating the hash page's opaque area when
>> is_prev_bucket_same_wrt.
> 
> Indeed, added an Assert() in else part. Was it same as your expectation?

Yep, close enough.  Thanks to the insert path we know that there will
be no tuples if (is_prim_bucket_same_wrt || is_prev_bucket_same_wrt),
and the replay path where the assertion is added.

>> I've been thinking about ways to make such cases detectable in an
>> automated fashion.  The best choice would be
>> verifyBackupPageConsistency(), just after RestoreBlockImage() on the
>> copy of the block image stored in WAL before applying the page masking
>> which would mask the LSN.  A problem, unfortunately, is that we would
>> need to transfer the knowledge of REGBUF_NO_CHANGE as a new BKPIMAGE_*
>> flag so we would be able to track if the block rebuilt from the record
>> has the *same* LSN as the copy used for the consistency check.  So
>> this edge consistency case would come at a cost, I am afraid, and the
>> 8 bits of bimg_info are precious :/
> 
> I could not decide that the change has more benefit than its cost, so I did
> nothing for it.

It does not prevent doing an implementation that can be used for some
test validation in the context of this thread.  Using this idea, I
have done the attached 0002.  This is not something to merge into the
tree, of course, just a toy to:
- Validate the fix for the problem we know.
- More braodly, check if there are other areas covered by the main
regression test suite if there are other problems.

And from what I can see, applying 0002 without 0001 causes the
following test to fail as the standby chokes on a inconsistent LSN
with a FATAL because the LSN of the apply page coming from the primary
and the LSN saved in the page replayed don't match.  Here is a command
to reproduce the problem:
cd src/test/recovery/ && \
  PG_TEST_EXTRA=wal_consistency_checking \
  PROVE_TESTS=t/027_stream_regress.pl make check

And then in the logs you'd see stuff like:
2024-04-10 16:52:21.844 JST [2437] FATAL:  inconsistent page LSN
replayed 0/A7E5CD18 primary 0/A7E56348
2024-04-10 16:52:21.844 JST [2437] CONTEXT:  WAL redo at 0/A7E59F68
for Hash/SQUEEZE_PAGE: prevblkno 28, nextblkno 4294967295, ntups 0,
is_primary T; blkref #1: rel 1663/16384/25434, blk 7 FPW; blkref #2:
rel 1663/16384/25434, blk 29 FPW; blkref #3: rel 1663/16384/25434, blk
28 FPW; blkref #5: rel 1663/16384/25434, blk 9 FPW; blkref #6: rel
1663/16384/25434, blk 0 FPW

I don't see other areas with a similar problem, at the extent of the
core regression tests, that is to say.  That's my way to say that your
patch looks good to me and that I'm planning to apply it to fix the
issue.

This shows me a more interesting issue unrelated to this thread:
027_stream_regress.pl would be stuck if the standby finds an
inconsistent page under wal_consistency_checking.  This needs to be
fixed before I'm able to create a buildfarm animal with this setup.
I'll spawn a new thread about that tomorrow.
--
Michael
From 4c6597f2ce57439696aecdbda08b374857ab715d Mon Sep 17 00:00:00 2001
From: Michael Paquier <mich...@paquier.xyz>
Date: Wed, 10 Apr 2024 16:47:29 +0900
Subject: [PATCH v3 1/2] Fix inconsistency with replay of hash squeeze record
 for clean buffers

Blah.
---
 src/backend/access/hash/hash_xlog.c | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index cb1a63cfee..d7ba74579b 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -666,6 +666,7 @@ hash_xlog_squeeze_page(XLogReaderState *record)
 		char	   *data;
 		Size		datalen;
 		uint16		ninserted = 0;
+		bool		mod_wbuf = false;
 
 		data = begin = XLogRecGetBlockData(record, 1, &datalen);
 
@@ -695,6 +696,17 @@ hash_xlog_squeeze_page(XLogReaderState *record)
 
 				ninserted++;
 			}
+
+			mod_wbuf = true;
+		}
+		else
+		{
+			/*
+			 * See _hash_freeovflpage() which has a similar assertion when
+			 * there are no tuples.
+			 */
+			Assert(xldata->is_prim_bucket_same_wrt ||
+				   xldata->is_prev_bucket_same_wrt);
 		}
 
 		/*
@@ -711,10 +723,15 @@ hash_xlog_squeeze_page(XLogReaderState *record)
 			HashPageOpaque writeopaque = HashPageGetOpaque(writepage);
 
 			writeopaque->hasho_nextblkno = xldata->nextblkno;
+			mod_wbuf = true;
 		}
 
-		PageSetLSN(writepage, lsn);
-		MarkBufferDirty(writebuf);
+		/* Set LSN and mark writebuf dirty iff it is modified */
+		if (mod_wbuf)
+		{
+			PageSetLSN(writepage, lsn);
+			MarkBufferDirty(writebuf);
+		}
 	}
 
 	/* replay the record for initializing overflow buffer */
-- 
2.43.0

From 22d5d6bbc093a80688ff4e9d4b5865c8fde37d9c Mon Sep 17 00:00:00 2001
From: Michael Paquier <mich...@paquier.xyz>
Date: Wed, 10 Apr 2024 16:48:21 +0900
Subject: [PATCH v3 2/2] Implement consistency check with clean buffers for
 wal_consistency_checking

This extends WAL records to track if a buffer registered in a record was
"clean", aka it is registered without being modified.  This is done by
adding a new flag called BKPIMAGE_CLEAN, recorded when a block uses
REGBUF_NO_CHANGE.

This uses a bit in a record for nothing, so while it is useful for
sanity checks, this should never *ever* be applied as-is.
---
 src/include/access/xlogreader.h           |  3 +++
 src/include/access/xlogrecord.h           |  2 ++
 src/backend/access/rmgrdesc/xlogdesc.c    | 11 +++++++++--
 src/backend/access/transam/xloginsert.c   |  4 ++++
 src/backend/access/transam/xlogreader.c   |  1 +
 src/backend/access/transam/xlogrecovery.c | 15 +++++++++++++++
 6 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 2e9e5f43eb..6e42346c31 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -135,6 +135,7 @@ typedef struct
 	/* Information on full-page image, if any */
 	bool		has_image;		/* has image, even for consistency checking */
 	bool		apply_image;	/* has image that should be restored */
+	bool		clean_image;	/* image cleanly registered  */
 	char	   *bkp_image;
 	uint16		hole_offset;
 	uint16		hole_length;
@@ -424,6 +425,8 @@ extern bool DecodeXLogRecord(XLogReaderState *state,
 	((decoder)->record->blocks[block_id].has_image)
 #define XLogRecBlockImageApply(decoder, block_id)		\
 	((decoder)->record->blocks[block_id].apply_image)
+#define XLogRecBlockImageClean(decoder, block_id)		\
+	((decoder)->record->blocks[block_id].clean_image)
 #define XLogRecHasBlockData(decoder, block_id)		\
 	((decoder)->record->blocks[block_id].has_data)
 
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index b9e5c59fae..c38de8a96e 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -157,6 +157,8 @@ typedef struct XLogRecordBlockImageHeader
 #define BKPIMAGE_HAS_HOLE		0x01	/* page image has "hole" */
 #define BKPIMAGE_APPLY			0x02	/* page image should be restored
 										 * during replay */
+#define BKPIMAGE_CLEAN			0x20	/* clean buffer registered */
+
 /* compression methods supported */
 #define BKPIMAGE_COMPRESS_PGLZ	0x04
 #define BKPIMAGE_COMPRESS_LZ4	0x08
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index e455400716..f045dfe8a0 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -272,8 +272,10 @@ XLogRecGetBlockRefInfo(XLogReaderState *record, bool pretty,
 						method = "unknown";
 
 					appendStringInfo(buf,
-									 " (FPW%s); hole: offset: %u, length: %u, "
+									 " (FPW%s%s); hole: offset: %u, length: %u, "
 									 "compression saved: %u, method: %s",
+									 XLogRecBlockImageClean(record, block_id) ?
+									 "" : " clean",
 									 XLogRecBlockImageApply(record, block_id) ?
 									 "" : " for WAL verification",
 									 XLogRecGetBlock(record, block_id)->hole_offset,
@@ -286,7 +288,9 @@ XLogRecGetBlockRefInfo(XLogReaderState *record, bool pretty,
 				else
 				{
 					appendStringInfo(buf,
-									 " (FPW%s); hole: offset: %u, length: %u",
+									 " (FPW%s%s); hole: offset: %u, length: %u",
+									 XLogRecBlockImageClean(record, block_id) ?
+									 "" : " clean",
 									 XLogRecBlockImageApply(record, block_id) ?
 									 "" : " for WAL verification",
 									 XLogRecGetBlock(record, block_id)->hole_offset,
@@ -329,6 +333,9 @@ XLogRecGetBlockRefInfo(XLogReaderState *record, bool pretty,
 					appendStringInfoString(buf, " FPW");
 				else
 					appendStringInfoString(buf, " FPW for WAL verification");
+
+				if (XLogRecBlockImageClean(record, block_id))
+					appendStringInfoString(buf, " clean");
 			}
 		}
 	}
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 9047601534..227c83c477 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -720,6 +720,10 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 			if (needs_backup)
 				bimg.bimg_info |= BKPIMAGE_APPLY;
 
+			/* If buffer is clean, register this information */
+			if (regbuf->flags & REGBUF_NO_CHANGE)
+				bimg.bimg_info |= BKPIMAGE_CLEAN;
+
 			if (is_compressed)
 			{
 				/* The current compression is stored in the WAL record */
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 37d2a57961..6737cbe9e3 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1791,6 +1791,7 @@ DecodeXLogRecord(XLogReaderState *state,
 				COPY_HEADER_FIELD(&blk->bimg_info, sizeof(uint8));
 
 				blk->apply_image = ((blk->bimg_info & BKPIMAGE_APPLY) != 0);
+				blk->clean_image = ((blk->bimg_info & BKPIMAGE_CLEAN) != 0);
 
 				if (BKPIMAGE_COMPRESSED(blk->bimg_info))
 				{
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index b2fe2d04cc..98ecb85c0b 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2551,6 +2551,21 @@ verifyBackupPageConsistency(XLogReaderState *record)
 					(errcode(ERRCODE_INTERNAL_ERROR),
 					 errmsg_internal("%s", record->errormsg_buf)));
 
+		/*
+		 * If the replayed block was registered while clean, check that its
+		 * LSN matches with the copy coming from the primary.
+		 */
+		if (XLogRecBlockImageClean(record, block_id))
+		{
+			XLogRecPtr	primary_lsn = PageGetLSN((Page) primary_image_masked);
+			XLogRecPtr	replay_lsn = PageGetLSN((Page) replay_image_masked);
+
+			if (primary_lsn != replay_lsn)
+				elog(FATAL, "inconsistent page LSN replayed %X/%X primary %X/%X",
+					 LSN_FORMAT_ARGS(replay_lsn),
+					 LSN_FORMAT_ARGS(primary_lsn));
+		}
+
 		/*
 		 * If masking function is defined, mask both the primary and replay
 		 * images
-- 
2.43.0

Attachment: signature.asc
Description: PGP signature

Reply via email to