On 19/03/2025 04:22, Tomas Vondra wrote:
I kept stress-testing this, and while the frequency massively increased
on PG18, I managed to reproduce this all the way back to PG14. I see
~100x more corefiles on PG18.

That is not a proof the issue was introduced in PG14, maybe it's just
the assert that was added there or something. Or maybe there's another
bug in PG18, making the impact worse.

But I'd suspect this is a bug in

commit 623a9ba79bbdd11c5eccb30b8bd5c446130e521c
Author: Andres Freund <and...@anarazel.de>
Date:   Mon Aug 17 21:07:10 2020 -0700

     snapshot scalability: cache snapshots using a xact completion counter.

     Previous commits made it faster/more scalable to compute snapshots.
But not
     building a snapshot is still faster. Now that GetSnapshotData() does not
     maintain RecentGlobal* anymore, that is actually not too hard:

     ...

Looking at the code, shouldn't ExpireAllKnownAssignedTransactionIds() and ExpireOldKnownAssignedTransactionIds() update xactCompletionCount? This can happen during hot standby:

1. Backend acquires snapshot A with xmin 1000
2. Startup process calls ExpireOldKnownAssignedTransactionIds(),
3. Backend acquires snapshot B with xmin 1050
4. Backend releases snapshot A, updating TransactionXmin to 1050
5. Backend acquires new snapshot, calls GetSnapshotDataReuse(), reusing snapshot A's data.

Because xactCompletionCount is not updated in step 2, the GetSnapshotDataReuse() call will reuse the snapshot A. But snapshot A has a lower xmin.

--
Heikki Linnakangas
Neon (https://neon.tech)



Reply via email to