Re: Improving connection scalability: GetSnapshotData()

Andres Freund Mon, 02 Mar 2020 15:24:55 -0800

Hi,

On 2020-03-01 00:36:01 -0800, Andres Freund wrote:
> Here are some numbers for the submitted patch series. I'd to cull some
> further improvements to make it more manageable, but I think the numbers
> still are quite convincing.
> 
> The workload is a pgbench readonly, with pgbench -M prepared -c $conns
> -j $conns -S -n for each client count.  This is on a machine with 2
> Intel(R) Xeon(R) Platinum 8168, but virtualized.
> 
> conns   tps master            tps pgxact-split
> 
> 1       26842.492845            26524.194821
> 10      246923.158682           249224.782661
> 50      695956.539704           709833.746374
> 100     1054727.043139          1903616.306028
> 200     964795.282957           1949200.338012
> 300     906029.377539           1927881.231478
> 400     845696.690912           1911065.369776
> 500     812295.222497           1926237.255856
> 600     888030.104213           1903047.236273
> 700     866896.532490           1886537.202142
> 800     863407.341506           1883768.592610
> 900     871386.608563           1874638.012128
> 1000    887668.277133           1876402.391502
> 1500    860051.361395           1815103.564241
> 2000    890900.098657           1775435.271018
> 3000    874184.980039           1653953.817997
> 4000    845023.080703           1582582.316043
> 5000    817100.195728           1512260.802371
> 
> I think these are pretty nice results.


> One further cool recognition of the fact that GetSnapshotData()'s
> results can be made to only depend on the set of xids in progress, is
> that caching the results of GetSnapshotData() is almost trivial at that
> point: We only need to recompute snapshots when a toplevel transaction
> commits/aborts.
> 
> So we can avoid rebuilding snapshots when no commt has happened since it
> was last built. Which amounts to assigning a current 'commit sequence
> number' to the snapshot, and checking that against the current number
> at the time of the next GetSnapshotData() call. Well, turns out there's
> this "LSN" thing we assign to commits (there are some small issues with
> that though).  I've experimented with that, and it considerably further
> improves the numbers above. Both with a higher peak throughput, but more
> importantly it almost entirely removes the throughput regression from
> 2000 connections onwards.
> 
> I'm still working on cleaning that part of the patch up, I'll post it in
> a bit.

I triggered a longer run on the same hardware, that also includes
numbers for the caching patch.

nclients        master  pgxact-split    pgxact-split-cache
1       29742.805074    29086.874404    28120.709885
2       58653.005921    56610.432919    57343.937924
3       116580.383993   115102.94057    117512.656103
4       150821.023662   154130.354635   152053.714824
5       186679.754357   189585.156519   191095.841847
6       219013.756252   223053.409306   224480.026711
7       256861.673892   256709.57311    262427.179555
8       291495.547691   294311.524297   296245.219028
9       332835.641015   333223.666809   335460.280487
10      367883.74842    373562.206447   375682.894433
15      561008.204553   578601.577916   587542.061911
20      748000.911053   794048.140682   810964.700467
25      904581.660543   1037279.089703  1043615.577083
30      999231.007768   1251113.123461  1288276.726489
35      1001274.289847  1438640.653822  1438508.432425
40      991672.445199   1518100.079695  1573310.171868
45      994427.395069   1575758.31948   1649264.339117
50      1017561.371878  1654776.716703  1715762.303282
60      993943.210188   1720318.989894  1789698.632656
70      971379.995255   1729836.303817  1819477.25356
80      966276.137538   1744019.347399  1842248.57152
90      901175.211649   1768907.069263  1847823.970726
100     803175.74326    1784636.397822  1865795.782943
125     664438.039582   1806275.514545  1870983.64688
150     623562.201749   1796229.009658  1876529.428419
175     680683.150597   1809321.487338  1910694.40987
200     668413.988251   1833457.942035  1878391.674828
225     682786.299485   1816577.462613  1884587.77743
250     727308.562076   1825796.324814  1864692.025853
275     676295.999761   1843098.107926  1908698.584573
300     698831.398432   1832068.168744  1892735.290045
400     661534.639489   1859641.983234  1898606.247281
500     645149.788352   1851124.475202  1888589.134422
600     740636.323211   1875152.669115  1880653.747185
700     858645.363292   1833527.505826  1874627.969414
800     858287.957814   1841914.668668  1892106.319085
900     882204.933544   1850998.221969  1868260.041595
1000    910988.551206   1836336.091652  1862945.18557
1500    917727.92827    1808822.338465  1864150.00307
2000    982137.053108   1813070.209217  1877104.342864
3000    1013514.639108  1753026.733843  1870416.924248
4000    1025476.80688   1600598.543635  1859908.314496
5000    1019889.160511  1534501.389169  1870132.571895
7500    968558.864242   1352137.828569  1853825.376742
10000   887558.112017   1198321.352461  1867384.381886
15000   687766.593628   950788.434914   1710509.977169

The odd dip for master between 90 and 700 connections looks like it's
not directly related to GetSnapshotData(). It looks like it's related to
the linux scheduler and virtiualization. When a pgbench thread and
postgres backend need to swap who gets executed, and both are on
different CPUs, the wakeup is more expensive when the target CPU is idle
or isn't going to reschedule soon. In the expensive path a
inter-process-interrupt (IPI) gets triggered, which requires to exit out
of the VM (which is really expensive on azure, apparently).  I can
trigger similar behaviour for the other runs by renicing, albeit on a
slightly smaller scale.

I'll try to find a larger system that's not virtualized :/.

Greetings,

Andres Freund

diff --git i/src/include/access/transam.h w/src/include/access/transam.h
index e37ecb52fa1..2b46d2b76a0 100644
--- i/src/include/access/transam.h
+++ w/src/include/access/transam.h
@@ -183,6 +183,8 @@ typedef struct VariableCacheData
 	TransactionId latestCompletedXid;	/* newest XID that has committed or
 										 * aborted */
 
+	uint64 csn;
+
 	/*
 	 * These fields are protected by CLogTruncationLock
 	 */
diff --git i/src/include/utils/snapshot.h w/src/include/utils/snapshot.h
index 2bc415376ac..389f18cf8a5 100644
--- i/src/include/utils/snapshot.h
+++ w/src/include/utils/snapshot.h
@@ -207,6 +207,8 @@ typedef struct SnapshotData
 
 	TimestampTz whenTaken;		/* timestamp when snapshot was taken */
 	XLogRecPtr	lsn;			/* position in the WAL stream when taken */
+
+	uint64		csn;
 } SnapshotData;
 
 #endif							/* SNAPSHOT_H */
diff --git i/src/backend/storage/ipc/procarray.c w/src/backend/storage/ipc/procarray.c
index 617853f56fc..520a79c9f73 100644
--- i/src/backend/storage/ipc/procarray.c
+++ w/src/backend/storage/ipc/procarray.c
@@ -65,6 +65,7 @@
 #include "utils/snapmgr.h"
 
 #define UINT32_ACCESS_ONCE(var)		 ((uint32)(*((volatile uint32 *)&(var))))
+#define UINT64_ACCESS_ONCE(var)		 ((uint64)(*((volatile uint64 *)&(var))))
 
 /* Our shared memory area */
 typedef struct ProcArrayStruct
@@ -266,6 +267,7 @@ CreateSharedProcArray(void)
 		procArray->lastOverflowedXid = InvalidTransactionId;
 		procArray->replication_slot_xmin = InvalidTransactionId;
 		procArray->replication_slot_catalog_xmin = InvalidTransactionId;
+		ShmemVariableCache->csn = 1;
 	}
 
 	allProcs = ProcGlobal->allProcs;
@@ -404,6 +406,8 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 		if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
 								  latestXid))
 			ShmemVariableCache->latestCompletedXid = latestXid;
+		/* Same with CSN */
+		ShmemVariableCache->csn++;
 
 		ProcGlobal->xids[proc->pgxactoff] = 0;
 		ProcGlobal->nsubxids[proc->pgxactoff] = 0;
@@ -531,6 +535,7 @@ ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
 {
 	size_t		pgxactoff = proc->pgxactoff;
 
+	Assert(LWLockHeldByMe(ProcArrayLock));
 	Assert(TransactionIdIsValid(ProcGlobal->xids[pgxactoff]));
 	Assert(ProcGlobal->xids[pgxactoff] == proc->xidCopy);
 
@@ -561,6 +566,9 @@ ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
 	if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
 							  latestXid))
 		ShmemVariableCache->latestCompletedXid = latestXid;
+
+	/* Same with CSN */
+	ShmemVariableCache->csn++;
 }
 
 /*
@@ -1637,7 +1645,7 @@ GetSnapshotData(Snapshot snapshot)
 	TransactionId oldestxid;
 	int			mypgxactoff;
 	TransactionId myxid;
-
+	uint64		csn;
 	TransactionId replication_slot_xmin = InvalidTransactionId;
 	TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
 
@@ -1675,6 +1683,26 @@ GetSnapshotData(Snapshot snapshot)
 					 errmsg("out of memory")));
 	}
 
+#if 1
+	if (snapshot->csn != 0 && MyProc->xidCopy == InvalidTransactionId &&
+		UINT64_ACCESS_ONCE(ShmemVariableCache->csn) == snapshot->csn)
+	{
+		if (!TransactionIdIsValid(MyProc->xmin))
+			MyProc->xmin = TransactionXmin = snapshot->xmin;
+		RecentXmin = snapshot->xmin;
+		Assert(TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin));
+
+		snapshot->curcid = GetCurrentCommandId(false);
+		snapshot->active_count = 0;
+		snapshot->regd_count = 0;
+		snapshot->copied = false;
+		snapshot->lsn = InvalidXLogRecPtr;
+		snapshot->whenTaken = 0;
+
+		return snapshot;
+	}
+#endif
+
 	/*
 	 * It is sufficient to get shared lock on ProcArrayLock, even if we are
 	 * going to set MyProc->xmin.
@@ -1687,6 +1715,7 @@ GetSnapshotData(Snapshot snapshot)
 
 	nextfxid = ShmemVariableCache->nextFullXid;
 	oldestxid = ShmemVariableCache->oldestXid;
+	csn = ShmemVariableCache->csn;
 
 	/* xmax is always latestCompletedXid + 1 */
 	xmax = ShmemVariableCache->latestCompletedXid;
@@ -1941,6 +1970,8 @@ GetSnapshotData(Snapshot snapshot)
 	snapshot->regd_count = 0;
 	snapshot->copied = false;
 
+	snapshot->csn = csn;
+
 	if (old_snapshot_threshold < 0)
 	{
 		/*
diff --git i/src/backend/utils/time/snapmgr.c w/src/backend/utils/time/snapmgr.c
index 41914f1a6c7..9d8ebe51307 100644
--- i/src/backend/utils/time/snapmgr.c
+++ w/src/backend/utils/time/snapmgr.c
@@ -604,6 +604,8 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
 	CurrentSnapshot->takenDuringRecovery = sourcesnap->takenDuringRecovery;
 	/* NB: curcid should NOT be copied, it's a local matter */
 
+	CurrentSnapshot->csn = 0;
+
 	/*
 	 * Now we have to fix what GetSnapshotData did with MyPgXact->xmin and
 	 * TransactionXmin.  There is a race condition: to make sure we are not
@@ -679,6 +681,7 @@ CopySnapshot(Snapshot snapshot)
 	newsnap->regd_count = 0;
 	newsnap->active_count = 0;
 	newsnap->copied = true;
+	newsnap->csn = 0;
 
 	/* setup XID array */
 	if (snapshot->xcnt > 0)
@@ -2180,6 +2183,7 @@ RestoreSnapshot(char *start_address)
 	snapshot->curcid = serialized_snapshot.curcid;
 	snapshot->whenTaken = serialized_snapshot.whenTaken;
 	snapshot->lsn = serialized_snapshot.lsn;
+	snapshot->csn = 0;
 
 	/* Copy XIDs, if present. */
 	if (serialized_snapshot.xcnt > 0)

Re: Improving connection scalability: GetSnapshotData()

Reply via email to