From b3ad993bd6f8758f0a91354b8448e01647207bf3 Mon Sep 17 00:00:00 2001
From: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com>
Date: Tue, 16 Dec 2025 10:48:12 +0100
Subject: Fix 'unexpected data beyond EOF' on replica restart

On restart, a replica can fail with an 'unexpected data beyond EOF in
block 200 of relation T/D/R' error. This can happen under the following
circumstances:

- A relation has a size of 400 blocks.
  - Blocks 201 to 400 are empty.
  - Block 200 has two rows.
  - Blocks 100 to 199 are empty.
- A restartpoint is done
- Vacuum truncates the relation to 200 blocks
- A FPW deletes a row in block 200
- A checkpoint is done
- A FPW deletes the last row in block 200
- Vacuum truncates the relation to 100 blocks
- The replica restarts

When the replica restarts:
- The relation on disk is reduced to 100 blocks due to having applied
  the truncate before restart.
- The first truncate to 200 blocks is replayed. It silently fails, but
  it will still update the cache size to 200 blocks
- The first FPW on block 200 is applied, XLogReadBufferForRead will rely
  on the cached size and incorrectly assume the page exists in file,
  and thus won't extend the relation.
- The Checkpoint Online is replayed, calling smgrdestroyall which will
  discard the cached size.
- The second FPW on block 200 is applied. This time, the detected size
  is 100 blocks, an extend is attempted. However, the block 200 is
  already present in the buffer table due to the first FPW. This
  triggers the 'unexpected data beyond EOF' since the page isn't new.

This patch fixes the issue by only updating smgr_cached_nblocks when
the truncated size is smaller. When the truncated size is higher, the
file isn't modified and we restore the old cached value.
---
 src/backend/storage/smgr/md.c   |  3 +++
 src/backend/storage/smgr/smgr.c | 12 +++++++++++-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 2ccb0faceb5..c2c7c66d42b 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1272,6 +1272,9 @@ mdnblocks(SMgrRelation reln, ForkNumber forknum)
  * functions for this relation or handled interrupts in between.  This makes
  * sure we have opened all active segments, so that truncate loop will get
  * them all!
+ *
+ * If nblocks > curnblk, the request is ignored when we are in InRecovery,
+ * otherwise, an error is raised.
  */
 void
 mdtruncate(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index bce37a36d51..ee2e25a35c8 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -870,6 +870,9 @@ smgrnblocks_cached(SMgrRelation reln, ForkNumber forknum)
  * be called in a critical section, but the current size must be checked
  * outside the critical section, and no interrupts or smgr functions relating
  * to this relation should be called in between.
+ *
+ * If the specified number of blocks is higher than the current size, the
+ * request is ignored when we are InRecovery, otherwise, an error is raised.
  */
 void
 smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
@@ -910,8 +913,15 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
 		 * backends to invalidate their copies of smgr_cached_nblocks, and
 		 * these ones too at the next command boundary. But ensure they aren't
 		 * outright wrong until then.
+		 *
+		 * nblocks > oldblocks can happen when a relation is truncated
+		 * multiple times and the restartpoint is located before the
+		 * truncates. The relation on disk will have the size of the second
+		 * truncate and when replaying the first truncate, we will have
+		 * nblocks > curnblk. We must restore old_nblocks when this happens.
 		 */
-		reln->smgr_cached_nblocks[forknum[i]] = nblocks[i];
+		reln->smgr_cached_nblocks[forknum[i]] =
+			nblocks[i] > old_nblocks[i] ? old_nblocks[i] : nblocks[i];
 	}
 }
 
-- 
2.51.0

