On 21/03/2025 12:28, Tomas Vondra wrote:
But it seems it changed in 952365cded6, which is:

     commit 952365cded635e54c4177399c0280cb7a5e34c11
     Author: Heikki Linnakangas <heikki.linnakan...@iki.fi>
     Date:   Mon Dec 23 12:42:39 2024 +0200

     Remove unnecessary GetTransactionSnapshot() calls

     In get_database_list() and get_subscription_list(), the
     GetTransactionSnapshot() call is not required because the catalog
     table scans use the catalog snapshot, which is held until the end of
     the scan. See table_beginscan_catalog(), which calls
     RegisterSnapshot(GetCatalogSnapshot(relid)).

     ...

I'm not claiming the commit is buggy - it might be, but I think it's
more likely it just made the preexisting bug easier to hit. I base this
on the observation that incrementing the xactCompletionCount made the
assert failures go away.

That commit removed three GetTransactionSnapshot() calls: one in autovacuum, one in logical replication launcher, and one in InitPostgres(). It must be the one in InitPostgres() that makes the difference, because the other two are not executed in a standby.

I was able to reproduce this and to catch it in action with 'rr'. After commit 952365cded6, the sequence of events looks like this:

1. At backend startup, GetCatalogSnapshot() sets CatalogSnapshotData.xmin=1000, xactCompletionCount=10 2. On first query, GetTransactionSnapshot() sets CurrentSnapshotData.xmin=1002, xactCompletionCount=10 3. GetCatalogSnapshot() is called again. It tries to reuse the snapshot with xmin=1000 and hits the assertion.

Before commit 952365cded6, which removed the GetTransactionSnapshot() call from InitPostgres, this usually happens instead:

0. At backend startup, GetTransactionSnapshot() sets CurrentSnapshotData.xmin=1000, xactCompletionCount=10 1. At backend startup, GetCatalogSnapshot() sets CatalogSnapshotData.xmin=1000, xactCompletionCount=10 2. On first query, GetTransactionSnapshot() reuses the snapshot with CurrentSnapshotData.xmin=1002, xactCompletionCount=10 3. GetCatalogSnapshot() is called again. It successfully reuses the snapshot with xmin=1000

In other words, the GetTransactionSnapshot() call initializes the CurrentSnapshotData with the snapshot with an earlier xmin. The first GetCatalogSnapshot() call happens almost immediately after that, so they're very likely - but not guaranteed - to get the same snapshot. When I removed the GetTransactionSnapshot() call, the gap between the first GetTransactionSnapshot() and GetCatalogSnapshot() calls in the backend became much longer, making this more likely to happen.

In any case, the missing "xactCompletionCount++" in ExpireAllKnownAssignedTransactionIds() and ExpireOldKnownAssignedTransactionIds() is a clear bug and explains this. I will prepare and commit patches to fix that.

Thanks for the testing!

--
Heikki Linnakangas
Neon (https://neon.tech)



Reply via email to