Two issues leading to discrepancies in FSM data on the standby server

Alexey Makhmutov Thu, 19 Mar 2026 18:32:43 -0700

We’ve recently observed a situation with significant increase inresponse time for insert operations after switching to a replica server.The collected information pointed to the discrepancy in the FSM data onthe replica side, which became visible to the insert sessions onceautovacuum process pulled incorrect data from from leaf blocks into FSMroot. The entire situation was looking like the case discussed inhttps://postgr.es/m/[email protected] andwhich was supposed to be fixed by ‘ab7dbd681’ (which introduced FSMupdate during 'heap_xlog_visible' invocation). However in our case andsynthetic tests we were able to see data blocks marked as ‘all visible’,but still having incorrect FSM records.

After analyzing the code I’ve noticed that during recovery FSM data isupdated in XLogRecordPageWithFreeSpace, which uses MarkBufferDirtyHintto mark FSM block as modified. However, if data checksums are enabled,then this call does nothing during recovery and is actually a no-op –basically it just exits immediately without marking block as dirty. Thelogic here is that as no new WAL data could not be generated during therecovery, so changes to hints in block should not mark block as dirty toavoid risk of torn pages being written. This seems logical, but it seemsnot aligned well with the FSM case, as its blocks could be just zeroedif checksum mismatch is detected. Currently changes to a FSM block couldbe lost if each change to the particular FSM block occur rarely enoughto allow its eviction from the cache. To persist the change themodification need to be performed while FSM block is still kept inbuffers and marked as dirty after receiving its FPI. If block wasalready cleaned, then the change won’t be persisted and stored FSMblocks may remain in an obsolete state. In our case the table had its'fillfactor' parameter set below 80, so during insert bursts each FSMblock on replica side was modified only during first access of FSM blocksince checkpoint (with FPI) and then by processing XLOG_HEAP2_VISIBLErecord for data once it was marked as ‘all visible’. This gives plentyof time to cleanup buffer between these moments, so the second changewas just never written to the disk. So, large number of blocks were leftwith incorrect data in FSM leaf blocks, which caused problem afterswitchover.

Given that FSM is ready to handle torn page writes andXLogRecordPageWithFreeSpace is called only during the recovery thereseems to be no reason to use MarkBufferDirtyHint here instead of aregular MarkBufferDirty call. The code is already trying to limitupdates to the FSM (i.e. by updating it only after reaching 80% of usedspace for regular DML), so we probably want to ensure that these updatesare actually persisted.

The second noticed issue (not related to our observed problem) isrelated to the ‘heap_xlog_visible’ – this function uses‘PageGetFreeSpace’ call instead of ‘PageGetHeapFreeSpace’ to get size offree space for regular heap blocks. This seems like a bug, as method'PageGetHeapFreeSpace' is used for any other case where we need to getfree space for a heap page. Usage of incorrect function could also causeincorrect data being written to the FSM on replica: if block still havefree space, but already reached MaxHeapTuplesPerPage limit, then itshould be marked as unavailable for new rows in FSM, otherwise inserterwill need to check and update its FSM data as well.

Attached are separate patches, which tries to fixes both these problems– calling ‘MarkBufferDirty’ instead of ‘MarkBufferDirtyHint’ in thefirst case and replacing ‘PageGetFreeSpace’ with ‘PageGetHeapFreeSpace’in the second case.

Two synthetic test cases are also attached which simulates both thesesituations – ‘test_case1.zip’ to simulate the problem with lost FSMupdate on replica side and ‘test_case2.zip’ to simulate incorrect FSMdata on standby server for blocks with large number of redirect slots.In both cases the ‘test_prepare.sh’ script could be edited to specifypath to PG installation and port numbers. Then invoke ‘test_preapre.sh’script to prepare two databases. For first case the second script‘test_run.sh’ need to be invoked after that to show large number ofblocks being visited for simple insert and for second test case state ofthe FSM (for single block) is just displayed at the end of‘test_prepare.sh’.


Thanks,
Alexey

From 3a51a4f3a920bed56910ae38f1c3f12059649c56 Mon Sep 17 00:00:00 2001
From: Alexey Makhmutov <[email protected]>
Date: Mon, 16 Mar 2026 13:32:45 +0300
Subject: [PATCH 1/2] Mark modified FSM buffer as dirty during recovery.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The XLogRecordPageWithFreeSpace function updates freespace map (FSM)
data while replaying data-level WAL records during the recovery. If FSM
block is updated, then it need to be marked as modified and currently
this task is performed using MarkBufferDirtyHint call (as in all other
cases for modifying of FSM data). However, in recovery context this
function will actually do nothing if checksums are enabled. It’s assumed
that page should not be dirtied during recovery while modifying hints to
protect from torn pages as no new WAL data could be generated at this
point to store FPI.

Such logic seems to be not fully aligned with the FSM case, as its
blocks could be just zeroed if checksum mismatch is detected. Currently
changes to a FSM block could be lost if each change to the particular
FSM block occur rarely enough to allow its eviction from the cache.
To persist the change the modification need to be performed while FSM
block is still kept in buffers and marked as dirty after receiving its
FPI. If block was already cleaned, then the change won’t be persisted
and stored FSM blocks may remain in an obsolete state.

If large number of discrepancies between data in leaf FSM blocks and
actual data blocks is accumulated on the replica server side, then this
could cause significant delays for insert operations after switchover.
Such insert operation may need to visit many data blocks marked as
having enough space in FSM only to discover that this information is
incorrect and FSM records need to be fixed. In a heavily trafficked
insert-only table with many concurrent clients performing inserts this
has been observed to cause several second stalls, causing visible
application malfunction. Desire to avoid such cases was a reason behind
the commit ab7dbd681, which introduced update of FSM data during the
heap_xlog_visible invocation. However, as update to the FSM data on the
standby side could be lost due to missing dirty flag, so there is still
a possibility of hitting such situation. Note, that having a zeroed FSM
page in such case (as result of checksum mismatch) is more preferable,
as zero value will be interpreted as indication of full data blocks and
inserter will be just routed to the next FSM block or to the end of the
table.

Given that FSM is ready to handle torn page writes and
XLogRecordPageWithFreeSpace is called only during the recovery there
seems to be no reason to use MarkBufferDirtyHint here instead of a
regular MarkBufferDirty call.
---
 src/backend/storage/freespace/freespace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index b9a8f368a63..5c5d86bc106 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -232,7 +232,7 @@ XLogRecordPageWithFreeSpace(RelFileLocator rlocator, BlockNumber heapBlk,
 		PageInit(page, BLCKSZ, 0);

 	if (fsm_set_avail(page, slot, new_cat))
-		MarkBufferDirtyHint(buf, false);
+		MarkBufferDirty(buf);
 	UnlockReleaseBuffer(buf);
 }

-- 
2.53.0

From e7ba11349dfa6c0dc250562c13c8db90d53dcdab Mon Sep 17 00:00:00 2001
From: Alexey Makhmutov <[email protected]>
Date: Mon, 16 Mar 2026 13:33:10 +0300
Subject: [PATCH 2/2] Use PageGetHeapFreeSpace in heap_xlog_visible.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Free space in regular heap pages need to be calculated using
PageGetHeapFreeSpace rather than PageGetFreeSpace. This is required to
take into account the MaxHeapTuplesPerPage limit, otherwise page may be
marked as having free space while it’s impossible to add any new row to
it.
---
 src/backend/access/heap/heapam_xlog.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/access/heap/heapam_xlog.c b/src/backend/access/heap/heapam_xlog.c
index 1da774c1536..c8f5f4f7988 100644
--- a/src/backend/access/heap/heapam_xlog.c
+++ b/src/backend/access/heap/heapam_xlog.c
@@ -326,7 +326,7 @@ heap_xlog_visible(XLogReaderState *record)
 
 	if (BufferIsValid(buffer))
 	{
-		Size		space = PageGetFreeSpace(BufferGetPage(buffer));
+		Size		space = PageGetHeapFreeSpace(BufferGetPage(buffer));
 
 		UnlockReleaseBuffer(buffer);
 
-- 
2.53.0

<<attachment: test_case1.zip>>

<<attachment: test_case2.zip>>

Two issues leading to discrepancies in FSM data on the standby server

Reply via email to