Hi, When analyzing some issues in another thread[1], I found a bug that fast forward decoding could lead to premature advancement of catalog_xmin, resulting in required catalog data being removed by vacuum.
The issue arises because we do not build a base snapshot when decoding changes during fast forward mode, preventing reference to the minimum transaction ID that remains visible in the snapshot when determining the candidate for catalog_xmin. As a result, catalog_xmin was directly advanced to the oldest running transaction ID found in the latest running_xacts record. In code-level, SnapBuildProcessChange -> ReorderBufferSetBaseSnapshot could not be reached during fast forward decoding, resulting in rb->txns_by_base_snapshot_lsn being NULL. When advancing catalog_xmin, the system attempts to refer to rb->txns_by_base_snapshot_lsn via ReorderBufferGetOldestXmin(). However, since rb->txns_by_base_snapshot_lsn is NULL, it defaults to directly using running->oldestRunningXid as the candidate for catalog_xmin. For reference, see the details in SnapBuildProcessRunningXacts. See the attachment for a test(0002) to prove that the catalog data that are still required would be removed after premature catalog_xmin advancement during fast forward decoding. To fix this, I think we can allow the base snapshot to be built during fast forward decoding, as implemented in the patch 0001 (We already built base snapshot in fast-forward mode for logical message in logicalmsg_decode()). I also explored the possibility of further optimizing the fix to reduce the overhead associated with building a snapshot in fast-forward mode. E.g., to maintain a single base_snapshot_xmin rather than generating a full snapshot for each transaction, and use this base_snapshot_xmin when determining the candidate catalog_xmin. However, I am not feeling confident about maintaining some fast-forward dedicated fields in common structures and perhaps employing different handling for catalog_xmin. Moreover, I conducted a basic test[2] to test the patch's impact, noting that advancing the slot incurs roughly a 4% increase in processing time after applying the patch, which appears to be acceptable. Additionally, the cost associated with building the snapshot via SnapBuildBuildSnapshot() did not show up in the profile. Therefore, I think it's reasonable to refrain from further optimization at this stage. [1] https://www.postgresql.org/message-id/OS3PR01MB5718ED3F0123737C7380BBAD94BB2%40OS3PR01MB5718.jpnprd01.prod.outlook.com [2] Create a table test(a int); Create a slot A. pgbench postgres -T 60 -j 100 -c 100 -f simple-insert simple-insert: BEGIN; INSERT INTO test VALUES(1); COMMIT; use pg_replication_slot_advance to advance the slot A to the latest position and count the time. HEAD: Time: 4189.257 ms PATCH: Time: 4371.358 ms Best Regards, Hou zj
v1-0002-Add-a-test.patch
Description: v1-0002-Add-a-test.patch
v1-0001-Fix-premature-xmin-advancement-during-fast-forwar.patch
Description: v1-0001-Fix-premature-xmin-advancement-during-fast-forwar.patch