Re: CSN snapshots in hot standby

Heikki Linnakangas Fri, 15 Nov 2024 11:16:50 -0800

On 29/10/2024 18:33, Heikki Linnakangas wrote:

I added two tests to the test suite:
                                 master     patched
insert-all-different-xids:     0.00027    0.00019 s / iteration
insert-all-different-subxids:  0.00023    0.00020 s / iteration
insert-all-different-xids: Open 1000 connections, insert one row ineach, and leave the transactions open. In the replica, select all the rows
insert-all-different-subxids: The same, but with 1 transaction with 1000subxids.
The point of these new tests is to test the scenario where the cachedoesn't help and just adds overhead, because each XID is looked up onlyonce. Seems to be fine. Surprisingly good actually; I'll do some moreprofiling on that to understand why it's even faster than 'master'.


Ok, I did some profiling and it makes sense:

In the insert-all-different-xids test on 'master', we spend about 60& ofCPU time in XidInMVCCSnapshot(), doing pg_lfind32() over the subxiparray. We should probably sort the array and use a binary search if it'slarge or something...

With these patches, instead of the pg_lfind32() over subxip array, weperform one CSN SLRU lookup instead, and the page is cached. There'slocking overhead etc. with that, but it's still cheaper than thepg_lfind32().

In the insert-all-different-subxids test on 'master', the subxip arrayis overflowed, so we call SubTransGetTopmostTransaction() on each XID.That's performs two pg_subtrans lookups for each XID, first for thesubxid, then for the parent. With these patches, we perform just oneSLRU lookup, in pg_csnlog, which is faster.

Now the downside of this new cache: Since it has no size limit, if youkeep looking up different XIDs, it will keep growing until it holds allthe XIDs between the snapshot's xmin and xmax. That can take a lot ofmemory in the worst case. Radix tree is pretty memory efficient, butholding, say 1 billion XIDs would probably take something like 500 MB ofRAM (the radix tree stores 64-bit words with 2 bits per XID, plus theradix tree nodes). That's per snapshot, so if you have a lot of 60&
connections, maybe even with multiple snapshots each, that can add up.
I'm inclined to accept that memory usage. If we wanted to limit the sizeof the cache, would need to choose a policy on how to truncate it(delete random nodes?), what the limit should be etc. But I think it'dbe rare to hit those cases in practice. If you have a one billion XIDold transaction running in the primary, you probably have biggerproblems already.

I'd love to hear some thoughts on this caching behavior. Is itacceptable to let the cache grow, potentially to very large sizes in theworst cases? Or do we need to make it more complicated and implementsome eviction policy?


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: CSN snapshots in hot standby

Reply via email to