Re: [HACKERS] Scaling shared buffer eviction

Robert Haas Tue, 23 Sep 2014 13:30:09 -0700

On Tue, Sep 23, 2014 at 10:55 AM, Robert Haas <robertmh...@gmail.com> wrote:
> But this gets at another point: the way we're benchmarking this right
> now, we're really conflating the effects of three different things:
>
> 1. Changing the locking regimen around the freelist and clocksweep.
> 2. Adding a bgreclaimer process.
> 3. Raising the number of buffer locking partitions.


I did some more experimentation on this.  Attached is a patch that
JUST does #1, and, as previously suggested, it uses a single spinlock
instead of using two of them that are probably in the same cacheline.
Without this patch, on a 32-client, read-only pgbench at scale factor
1000 and shared_buffers=8GB, perf -e cs -g -a has this to say about
LWLock-inspired preemption:

         - LWLockAcquire
            + 68.41% ReadBuffer_common
            + 31.59% StrategyGetBuffer

With the patch, unsurprisingly, StrategyGetBuffer disappears and the
only lwlocks that are causing context-switches are the individual
buffer locks.  But are we suffering spinlock contention instead as a
result?   Yes, but not much.  s_lock is at 0.41% in the corresponding
profile, and only 6.83% of those calls are from the patched
StrategyGetBuffer.  In a similar profile of master, s_lock is at
0.31%.  So there's a bit of additional s_lock contention, but I think
it's considerably less than the contention over the lwlock it's
replacing, because the spinlock is only held for the minimal amount of
time needed, whereas the lwlock could be held across taking and
releasing many individual buffer locks.

TPS results are a little higher with the patch - these are alternating
5 minute runs:

master tps = 176010.647944 (including connections establishing)
master tps = 176615.291149 (including connections establishing)
master tps = 175648.370487 (including connections establishing)
reduce-replacement-locking tps = 177888.734320 (including connections
establishing)
reduce-replacement-locking tps = 177797.842410 (including connections
establishing)
reduce-replacement-locking tps = 177894.822656 (including connections
establishing)

The picture is similar at 64 clients, but the benefit is a little more:

master tps = 179037.231597 (including connections establishing)
master tps = 180500.937068 (including connections establishing)
master tps = 181565.706514 (including connections establishing)
reduce-replacement-locking tps = 185741.503425 (including connections
establishing)
reduce-replacement-locking tps = 188598.728062 (including connections
establishing)
reduce-replacement-locking tps = 187340.977277 (including connections
establishing)

What's interesting is that I can't see in the perf output any real
sign that the buffer mapping locks are slowing things down, but they
clearly are, because when I take this patch and also boost
NUM_BUFFER_PARTITIONS to 128, the performance goes up:

reduce-replacement-locking-128 tps = 251001.812843 (including
connections establishing)
reduce-replacement-locking-128 tps = 247368.925927 (including
connections establishing)
reduce-replacement-locking-128 tps = 250775.304177 (including
connections establishing)

The performance also goes up if I do that on master, but the effect is
substantially less:

master-128 tps = 219301.492902 (including connections establishing)
master-128 tps = 219786.249076 (including connections establishing)
master-128 tps = 219821.220271 (including connections establishing)

I think this shows pretty clearly that, even without the bgreclaimer,
there's a lot of merit in getting rid of BufFreelistLock and using a
spinlock held for the absolutely minimal number of instructions
instead.  There's already some benefit without doing anything about
the buffer mapping locks, and we'll get a lot more benefit once that
issue is addressed.  I think we need to do some serious study to see
whether bgreclaimer is even necessary, because it looks like this
change alone, which is much simpler, takes a huge bite out of the
scalability problem.

I welcome further testing and comments, but my current inclination is
to go ahead and push the attached patch.  To my knowledge, nobody has
at any point objected to this aspect of what's being proposed, and it
looks safe to me and seems to be a clear win.  We can then deal with
the other issues on their own merits.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

commit df4077cda2eae3eb4a5cf387da0c1e7616e73204
Author: Robert Haas <rh...@postgresql.org>
Date:   Mon Sep 22 16:42:14 2014 -0400

    Remove volatile qualifiers from lwlock.c.
    
    Now that spinlocks (hopefully!) act as compiler barriers, as of commit
    0709b7ee72e4bc71ad07b7120acd117265ab51d0, this should be safe.  This
    serves as a demonstration of the new coding style, and may be optimized
    better on some machines as well.

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 7c96da5..66fb2e4 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -112,7 +112,7 @@ static lwlock_stats lwlock_stats_dummy;
 bool		Trace_lwlocks = false;
 
 inline static void
-PRINT_LWDEBUG(const char *where, const volatile LWLock *lock)
+PRINT_LWDEBUG(const char *where, const LWLock *lock)
 {
 	if (Trace_lwlocks)
 		elog(LOG, "%s(%s %d): excl %d shared %d head %p rOK %d",
@@ -406,9 +406,7 @@ LWLock *
 LWLockAssign(void)
 {
 	LWLock	   *result;
-
-	/* use volatile pointer to prevent code rearrangement */
-	volatile int *LWLockCounter;
+	int		   *LWLockCounter;
 
 	LWLockCounter = (int *) ((char *) MainLWLockArray - 3 * sizeof(int));
 	SpinLockAcquire(ShmemLock);
@@ -429,9 +427,7 @@ int
 LWLockNewTrancheId(void)
 {
 	int			result;
-
-	/* use volatile pointer to prevent code rearrangement */
-	volatile int *LWLockCounter;
+	int		   *LWLockCounter;
 
 	LWLockCounter = (int *) ((char *) MainLWLockArray - 3 * sizeof(int));
 	SpinLockAcquire(ShmemLock);
@@ -511,9 +507,8 @@ LWLockAcquireWithVar(LWLock *l, uint64 *valptr, uint64 val)
 
 /* internal function to implement LWLockAcquire and LWLockAcquireWithVar */
 static inline bool
-LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
+LWLockAcquireCommon(LWLock *lock, LWLockMode mode, uint64 *valptr, uint64 val)
 {
-	volatile LWLock *lock = l;
 	PGPROC	   *proc = MyProc;
 	bool		retry = false;
 	bool		result = true;
@@ -525,7 +520,7 @@ LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
 	PRINT_LWDEBUG("LWLockAcquire", lock);
 
 #ifdef LWLOCK_STATS
-	lwstats = get_lwlock_stats_entry(l);
+	lwstats = get_lwlock_stats_entry(lock);
 
 	/* Count lock acquisition attempts */
 	if (mode == LW_EXCLUSIVE)
@@ -642,13 +637,13 @@ LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
 		 * so that the lock manager or signal manager will see the received
 		 * signal when it next waits.
 		 */
-		LOG_LWDEBUG("LWLockAcquire", T_NAME(l), T_ID(l), "waiting");
+		LOG_LWDEBUG("LWLockAcquire", T_NAME(lock), T_ID(lock), "waiting");
 
 #ifdef LWLOCK_STATS
 		lwstats->block_count++;
 #endif
 
-		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(l), T_ID(l), mode);
+		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), T_ID(lock), mode);
 
 		for (;;)
 		{
@@ -659,9 +654,9 @@ LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
 			extraWaits++;
 		}
 
-		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(l), T_ID(l), mode);
+		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), T_ID(lock), mode);
 
-		LOG_LWDEBUG("LWLockAcquire", T_NAME(l), T_ID(l), "awakened");
+		LOG_LWDEBUG("LWLockAcquire", T_NAME(lock), T_ID(lock), "awakened");
 
 		/* Now loop back and try to acquire lock again. */
 		retry = true;
@@ -675,10 +670,10 @@ LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
 	/* We are done updating shared state of the lock itself. */
 	SpinLockRelease(&lock->mutex);
 
-	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(l), T_ID(l), mode);
+	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(lock), T_ID(lock), mode);
 
 	/* Add lock to list of locks held by this backend */
-	held_lwlocks[num_held_lwlocks++] = l;
+	held_lwlocks[num_held_lwlocks++] = lock;
 
 	/*
 	 * Fix the process wait semaphore's count for any absorbed wakeups.
@@ -697,9 +692,8 @@ LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
  * If successful, cancel/die interrupts are held off until lock release.
  */
 bool
-LWLockConditionalAcquire(LWLock *l, LWLockMode mode)
+LWLockConditionalAcquire(LWLock *lock, LWLockMode mode)
 {
-	volatile LWLock *lock = l;
 	bool		mustwait;
 
 	PRINT_LWDEBUG("LWLockConditionalAcquire", lock);
@@ -747,14 +741,16 @@ LWLockConditionalAcquire(LWLock *l, LWLockMode mode)
 	{
 		/* Failed to get lock, so release interrupt holdoff */
 		RESUME_INTERRUPTS();
-		LOG_LWDEBUG("LWLockConditionalAcquire", T_NAME(l), T_ID(l), "failed");
-		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_FAIL(T_NAME(l), T_ID(l), mode);
+		LOG_LWDEBUG("LWLockConditionalAcquire",
+					T_NAME(lock), T_ID(lock), "failed");
+		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_FAIL(T_NAME(lock),
+												 T_ID(lock), mode);
 	}
 	else
 	{
 		/* Add lock to list of locks held by this backend */
-		held_lwlocks[num_held_lwlocks++] = l;
-		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(T_NAME(l), T_ID(l), mode);
+		held_lwlocks[num_held_lwlocks++] = lock;
+		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(T_NAME(lock), T_ID(lock), mode);
 	}
 
 	return !mustwait;
@@ -775,9 +771,8 @@ LWLockConditionalAcquire(LWLock *l, LWLockMode mode)
  * wake up, observe that their records have already been flushed, and return.
  */
 bool
-LWLockAcquireOrWait(LWLock *l, LWLockMode mode)
+LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
 {
-	volatile LWLock *lock = l;
 	PGPROC	   *proc = MyProc;
 	bool		mustwait;
 	int			extraWaits = 0;
@@ -788,7 +783,7 @@ LWLockAcquireOrWait(LWLock *l, LWLockMode mode)
 	PRINT_LWDEBUG("LWLockAcquireOrWait", lock);
 
 #ifdef LWLOCK_STATS
-	lwstats = get_lwlock_stats_entry(l);
+	lwstats = get_lwlock_stats_entry(lock);
 #endif
 
 	/* Ensure we will have room to remember the lock */
@@ -855,13 +850,14 @@ LWLockAcquireOrWait(LWLock *l, LWLockMode mode)
 		 * Wait until awakened.  Like in LWLockAcquire, be prepared for bogus
 		 * wakups, because we share the semaphore with ProcWaitForSignal.
 		 */
-		LOG_LWDEBUG("LWLockAcquireOrWait", T_NAME(l), T_ID(l), "waiting");
+		LOG_LWDEBUG("LWLockAcquireOrWait", T_NAME(lock), T_ID(lock),
+					"waiting");
 
 #ifdef LWLOCK_STATS
 		lwstats->block_count++;
 #endif
 
-		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(l), T_ID(l), mode);
+		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), T_ID(lock), mode);
 
 		for (;;)
 		{
@@ -872,9 +868,10 @@ LWLockAcquireOrWait(LWLock *l, LWLockMode mode)
 			extraWaits++;
 		}
 
-		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(l), T_ID(l), mode);
+		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), T_ID(lock), mode);
 
-		LOG_LWDEBUG("LWLockAcquireOrWait", T_NAME(l), T_ID(l), "awakened");
+		LOG_LWDEBUG("LWLockAcquireOrWait", T_NAME(lock), T_ID(lock),
+					"awakened");
 	}
 	else
 	{
@@ -892,14 +889,16 @@ LWLockAcquireOrWait(LWLock *l, LWLockMode mode)
 	{
 		/* Failed to get lock, so release interrupt holdoff */
 		RESUME_INTERRUPTS();
-		LOG_LWDEBUG("LWLockAcquireOrWait", T_NAME(l), T_ID(l), "failed");
-		TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT_FAIL(T_NAME(l), T_ID(l), mode);
+		LOG_LWDEBUG("LWLockAcquireOrWait", T_NAME(lock), T_ID(lock), "failed");
+		TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT_FAIL(T_NAME(lock), T_ID(lock),
+													 mode);
 	}
 	else
 	{
 		/* Add lock to list of locks held by this backend */
-		held_lwlocks[num_held_lwlocks++] = l;
-		TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT(T_NAME(l), T_ID(l), mode);
+		held_lwlocks[num_held_lwlocks++] = lock;
+		TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT(T_NAME(lock), T_ID(lock),
+												mode);
 	}
 
 	return !mustwait;
@@ -924,10 +923,8 @@ LWLockAcquireOrWait(LWLock *l, LWLockMode mode)
  * in shared mode, returns 'true'.
  */
 bool
-LWLockWaitForVar(LWLock *l, uint64 *valptr, uint64 oldval, uint64 *newval)
+LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval)
 {
-	volatile LWLock *lock = l;
-	volatile uint64 *valp = valptr;
 	PGPROC	   *proc = MyProc;
 	int			extraWaits = 0;
 	bool		result = false;
@@ -938,7 +935,7 @@ LWLockWaitForVar(LWLock *l, uint64 *valptr, uint64 oldval, uint64 *newval)
 	PRINT_LWDEBUG("LWLockWaitForVar", lock);
 
 #ifdef LWLOCK_STATS
-	lwstats = get_lwlock_stats_entry(l);
+	lwstats = get_lwlock_stats_entry(lock);
 #endif   /* LWLOCK_STATS */
 
 	/*
@@ -981,7 +978,7 @@ LWLockWaitForVar(LWLock *l, uint64 *valptr, uint64 oldval, uint64 *newval)
 		}
 		else
 		{
-			value = *valp;
+			value = *valptr;
 			if (value != oldval)
 			{
 				result = false;
@@ -1023,13 +1020,14 @@ LWLockWaitForVar(LWLock *l, uint64 *valptr, uint64 oldval, uint64 *newval)
 		 * so that the lock manager or signal manager will see the received
 		 * signal when it next waits.
 		 */
-		LOG_LWDEBUG("LWLockWaitForVar", T_NAME(l), T_ID(l), "waiting");
+		LOG_LWDEBUG("LWLockWaitForVar", T_NAME(lock), T_ID(lock), "waiting");
 
 #ifdef LWLOCK_STATS
 		lwstats->block_count++;
 #endif
 
-		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(l), T_ID(l), LW_EXCLUSIVE);
+		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), T_ID(lock),
+										   LW_EXCLUSIVE);
 
 		for (;;)
 		{
@@ -1040,9 +1038,10 @@ LWLockWaitForVar(LWLock *l, uint64 *valptr, uint64 oldval, uint64 *newval)
 			extraWaits++;
 		}
 
-		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(l), T_ID(l), LW_EXCLUSIVE);
+		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), T_ID(lock),
+										  LW_EXCLUSIVE);
 
-		LOG_LWDEBUG("LWLockWaitForVar", T_NAME(l), T_ID(l), "awakened");
+		LOG_LWDEBUG("LWLockWaitForVar", T_NAME(lock), T_ID(lock), "awakened");
 
 		/* Now loop back and check the status of the lock again. */
 	}
@@ -1050,7 +1049,7 @@ LWLockWaitForVar(LWLock *l, uint64 *valptr, uint64 oldval, uint64 *newval)
 	/* We are done updating shared state of the lock itself. */
 	SpinLockRelease(&lock->mutex);
 
-	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(l), T_ID(l), LW_EXCLUSIVE);
+	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(lock), T_ID(lock), LW_EXCLUSIVE);
 
 	/*
 	 * Fix the process wait semaphore's count for any absorbed wakeups.
@@ -1078,10 +1077,8 @@ LWLockWaitForVar(LWLock *l, uint64 *valptr, uint64 oldval, uint64 *newval)
  * The caller must be holding the lock in exclusive mode.
  */
 void
-LWLockUpdateVar(LWLock *l, uint64 *valptr, uint64 val)
+LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
 {
-	volatile LWLock *lock = l;
-	volatile uint64 *valp = valptr;
 	PGPROC	   *head;
 	PGPROC	   *proc;
 	PGPROC	   *next;
@@ -1093,7 +1090,7 @@ LWLockUpdateVar(LWLock *l, uint64 *valptr, uint64 val)
 	Assert(lock->exclusive == 1);
 
 	/* Update the lock's value */
-	*valp = val;
+	*valptr = val;
 
 	/*
 	 * See if there are any LW_WAIT_UNTIL_FREE waiters that need to be woken
@@ -1139,9 +1136,8 @@ LWLockUpdateVar(LWLock *l, uint64 *valptr, uint64 val)
  * LWLockRelease - release a previously acquired lock
  */
 void
-LWLockRelease(LWLock *l)
+LWLockRelease(LWLock *lock)
 {
-	volatile LWLock *lock = l;
 	PGPROC	   *head;
 	PGPROC	   *proc;
 	int			i;
@@ -1154,11 +1150,11 @@ LWLockRelease(LWLock *l)
 	 */
 	for (i = num_held_lwlocks; --i >= 0;)
 	{
-		if (l == held_lwlocks[i])
+		if (lock == held_lwlocks[i])
 			break;
 	}
 	if (i < 0)
-		elog(ERROR, "lock %s %d is not held", T_NAME(l), T_ID(l));
+		elog(ERROR, "lock %s %d is not held", T_NAME(lock), T_ID(lock));
 	num_held_lwlocks--;
 	for (; i < num_held_lwlocks; i++)
 		held_lwlocks[i] = held_lwlocks[i + 1];
@@ -1238,14 +1234,15 @@ LWLockRelease(LWLock *l)
 	/* We are done updating shared state of the lock itself. */
 	SpinLockRelease(&lock->mutex);
 
-	TRACE_POSTGRESQL_LWLOCK_RELEASE(T_NAME(l), T_ID(l));
+	TRACE_POSTGRESQL_LWLOCK_RELEASE(T_NAME(lock), T_ID(lock));
 
 	/*
 	 * Awaken any waiters I removed from the queue.
 	 */
 	while (head != NULL)
 	{
-		LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter");
+		LOG_LWDEBUG("LWLockRelease", T_NAME(lock), T_ID(lock),
+					"release waiter");
 		proc = head;
 		head = proc->lwWaitLink;
 		proc->lwWaitLink = NULL;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Scaling shared buffer eviction

Reply via email to