Hi, On 2020-03-01 00:36:01 -0800, Andres Freund wrote: > Here are some numbers for the submitted patch series. I'd to cull some > further improvements to make it more manageable, but I think the numbers > still are quite convincing. > > The workload is a pgbench readonly, with pgbench -M prepared -c $conns > -j $conns -S -n for each client count. This is on a machine with 2 > Intel(R) Xeon(R) Platinum 8168, but virtualized. > > conns tps master tps pgxact-split > > 1 26842.492845 26524.194821 > 10 246923.158682 249224.782661 > 50 695956.539704 709833.746374 > 100 1054727.043139 1903616.306028 > 200 964795.282957 1949200.338012 > 300 906029.377539 1927881.231478 > 400 845696.690912 1911065.369776 > 500 812295.222497 1926237.255856 > 600 888030.104213 1903047.236273 > 700 866896.532490 1886537.202142 > 800 863407.341506 1883768.592610 > 900 871386.608563 1874638.012128 > 1000 887668.277133 1876402.391502 > 1500 860051.361395 1815103.564241 > 2000 890900.098657 1775435.271018 > 3000 874184.980039 1653953.817997 > 4000 845023.080703 1582582.316043 > 5000 817100.195728 1512260.802371 > > I think these are pretty nice results.
> One further cool recognition of the fact that GetSnapshotData()'s > results can be made to only depend on the set of xids in progress, is > that caching the results of GetSnapshotData() is almost trivial at that > point: We only need to recompute snapshots when a toplevel transaction > commits/aborts. > > So we can avoid rebuilding snapshots when no commt has happened since it > was last built. Which amounts to assigning a current 'commit sequence > number' to the snapshot, and checking that against the current number > at the time of the next GetSnapshotData() call. Well, turns out there's > this "LSN" thing we assign to commits (there are some small issues with > that though). I've experimented with that, and it considerably further > improves the numbers above. Both with a higher peak throughput, but more > importantly it almost entirely removes the throughput regression from > 2000 connections onwards. > > I'm still working on cleaning that part of the patch up, I'll post it in > a bit. I triggered a longer run on the same hardware, that also includes numbers for the caching patch. nclients master pgxact-split pgxact-split-cache 1 29742.805074 29086.874404 28120.709885 2 58653.005921 56610.432919 57343.937924 3 116580.383993 115102.94057 117512.656103 4 150821.023662 154130.354635 152053.714824 5 186679.754357 189585.156519 191095.841847 6 219013.756252 223053.409306 224480.026711 7 256861.673892 256709.57311 262427.179555 8 291495.547691 294311.524297 296245.219028 9 332835.641015 333223.666809 335460.280487 10 367883.74842 373562.206447 375682.894433 15 561008.204553 578601.577916 587542.061911 20 748000.911053 794048.140682 810964.700467 25 904581.660543 1037279.089703 1043615.577083 30 999231.007768 1251113.123461 1288276.726489 35 1001274.289847 1438640.653822 1438508.432425 40 991672.445199 1518100.079695 1573310.171868 45 994427.395069 1575758.31948 1649264.339117 50 1017561.371878 1654776.716703 1715762.303282 60 993943.210188 1720318.989894 1789698.632656 70 971379.995255 1729836.303817 1819477.25356 80 966276.137538 1744019.347399 1842248.57152 90 901175.211649 1768907.069263 1847823.970726 100 803175.74326 1784636.397822 1865795.782943 125 664438.039582 1806275.514545 1870983.64688 150 623562.201749 1796229.009658 1876529.428419 175 680683.150597 1809321.487338 1910694.40987 200 668413.988251 1833457.942035 1878391.674828 225 682786.299485 1816577.462613 1884587.77743 250 727308.562076 1825796.324814 1864692.025853 275 676295.999761 1843098.107926 1908698.584573 300 698831.398432 1832068.168744 1892735.290045 400 661534.639489 1859641.983234 1898606.247281 500 645149.788352 1851124.475202 1888589.134422 600 740636.323211 1875152.669115 1880653.747185 700 858645.363292 1833527.505826 1874627.969414 800 858287.957814 1841914.668668 1892106.319085 900 882204.933544 1850998.221969 1868260.041595 1000 910988.551206 1836336.091652 1862945.18557 1500 917727.92827 1808822.338465 1864150.00307 2000 982137.053108 1813070.209217 1877104.342864 3000 1013514.639108 1753026.733843 1870416.924248 4000 1025476.80688 1600598.543635 1859908.314496 5000 1019889.160511 1534501.389169 1870132.571895 7500 968558.864242 1352137.828569 1853825.376742 10000 887558.112017 1198321.352461 1867384.381886 15000 687766.593628 950788.434914 1710509.977169 The odd dip for master between 90 and 700 connections looks like it's not directly related to GetSnapshotData(). It looks like it's related to the linux scheduler and virtiualization. When a pgbench thread and postgres backend need to swap who gets executed, and both are on different CPUs, the wakeup is more expensive when the target CPU is idle or isn't going to reschedule soon. In the expensive path a inter-process-interrupt (IPI) gets triggered, which requires to exit out of the VM (which is really expensive on azure, apparently). I can trigger similar behaviour for the other runs by renicing, albeit on a slightly smaller scale. I'll try to find a larger system that's not virtualized :/. Greetings, Andres Freund
diff --git i/src/include/access/transam.h w/src/include/access/transam.h index e37ecb52fa1..2b46d2b76a0 100644 --- i/src/include/access/transam.h +++ w/src/include/access/transam.h @@ -183,6 +183,8 @@ typedef struct VariableCacheData TransactionId latestCompletedXid; /* newest XID that has committed or * aborted */ + uint64 csn; + /* * These fields are protected by CLogTruncationLock */ diff --git i/src/include/utils/snapshot.h w/src/include/utils/snapshot.h index 2bc415376ac..389f18cf8a5 100644 --- i/src/include/utils/snapshot.h +++ w/src/include/utils/snapshot.h @@ -207,6 +207,8 @@ typedef struct SnapshotData TimestampTz whenTaken; /* timestamp when snapshot was taken */ XLogRecPtr lsn; /* position in the WAL stream when taken */ + + uint64 csn; } SnapshotData; #endif /* SNAPSHOT_H */ diff --git i/src/backend/storage/ipc/procarray.c w/src/backend/storage/ipc/procarray.c index 617853f56fc..520a79c9f73 100644 --- i/src/backend/storage/ipc/procarray.c +++ w/src/backend/storage/ipc/procarray.c @@ -65,6 +65,7 @@ #include "utils/snapmgr.h" #define UINT32_ACCESS_ONCE(var) ((uint32)(*((volatile uint32 *)&(var)))) +#define UINT64_ACCESS_ONCE(var) ((uint64)(*((volatile uint64 *)&(var)))) /* Our shared memory area */ typedef struct ProcArrayStruct @@ -266,6 +267,7 @@ CreateSharedProcArray(void) procArray->lastOverflowedXid = InvalidTransactionId; procArray->replication_slot_xmin = InvalidTransactionId; procArray->replication_slot_catalog_xmin = InvalidTransactionId; + ShmemVariableCache->csn = 1; } allProcs = ProcGlobal->allProcs; @@ -404,6 +406,8 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid) if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid, latestXid)) ShmemVariableCache->latestCompletedXid = latestXid; + /* Same with CSN */ + ShmemVariableCache->csn++; ProcGlobal->xids[proc->pgxactoff] = 0; ProcGlobal->nsubxids[proc->pgxactoff] = 0; @@ -531,6 +535,7 @@ ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid) { size_t pgxactoff = proc->pgxactoff; + Assert(LWLockHeldByMe(ProcArrayLock)); Assert(TransactionIdIsValid(ProcGlobal->xids[pgxactoff])); Assert(ProcGlobal->xids[pgxactoff] == proc->xidCopy); @@ -561,6 +566,9 @@ ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid) if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid, latestXid)) ShmemVariableCache->latestCompletedXid = latestXid; + + /* Same with CSN */ + ShmemVariableCache->csn++; } /* @@ -1637,7 +1645,7 @@ GetSnapshotData(Snapshot snapshot) TransactionId oldestxid; int mypgxactoff; TransactionId myxid; - + uint64 csn; TransactionId replication_slot_xmin = InvalidTransactionId; TransactionId replication_slot_catalog_xmin = InvalidTransactionId; @@ -1675,6 +1683,26 @@ GetSnapshotData(Snapshot snapshot) errmsg("out of memory"))); } +#if 1 + if (snapshot->csn != 0 && MyProc->xidCopy == InvalidTransactionId && + UINT64_ACCESS_ONCE(ShmemVariableCache->csn) == snapshot->csn) + { + if (!TransactionIdIsValid(MyProc->xmin)) + MyProc->xmin = TransactionXmin = snapshot->xmin; + RecentXmin = snapshot->xmin; + Assert(TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)); + + snapshot->curcid = GetCurrentCommandId(false); + snapshot->active_count = 0; + snapshot->regd_count = 0; + snapshot->copied = false; + snapshot->lsn = InvalidXLogRecPtr; + snapshot->whenTaken = 0; + + return snapshot; + } +#endif + /* * It is sufficient to get shared lock on ProcArrayLock, even if we are * going to set MyProc->xmin. @@ -1687,6 +1715,7 @@ GetSnapshotData(Snapshot snapshot) nextfxid = ShmemVariableCache->nextFullXid; oldestxid = ShmemVariableCache->oldestXid; + csn = ShmemVariableCache->csn; /* xmax is always latestCompletedXid + 1 */ xmax = ShmemVariableCache->latestCompletedXid; @@ -1941,6 +1970,8 @@ GetSnapshotData(Snapshot snapshot) snapshot->regd_count = 0; snapshot->copied = false; + snapshot->csn = csn; + if (old_snapshot_threshold < 0) { /* diff --git i/src/backend/utils/time/snapmgr.c w/src/backend/utils/time/snapmgr.c index 41914f1a6c7..9d8ebe51307 100644 --- i/src/backend/utils/time/snapmgr.c +++ w/src/backend/utils/time/snapmgr.c @@ -604,6 +604,8 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid, CurrentSnapshot->takenDuringRecovery = sourcesnap->takenDuringRecovery; /* NB: curcid should NOT be copied, it's a local matter */ + CurrentSnapshot->csn = 0; + /* * Now we have to fix what GetSnapshotData did with MyPgXact->xmin and * TransactionXmin. There is a race condition: to make sure we are not @@ -679,6 +681,7 @@ CopySnapshot(Snapshot snapshot) newsnap->regd_count = 0; newsnap->active_count = 0; newsnap->copied = true; + newsnap->csn = 0; /* setup XID array */ if (snapshot->xcnt > 0) @@ -2180,6 +2183,7 @@ RestoreSnapshot(char *start_address) snapshot->curcid = serialized_snapshot.curcid; snapshot->whenTaken = serialized_snapshot.whenTaken; snapshot->lsn = serialized_snapshot.lsn; + snapshot->csn = 0; /* Copy XIDs, if present. */ if (serialized_snapshot.xcnt > 0)