On Wed, Nov 18, 2020 at 5:25 PM Heikki Linnakangas <hlinn...@iki.fi> wrote:

> On 11/08/2020 03:41, Andres Freund wrote:
> > On 2020-08-10 18:27:17 -0400, Robert Haas wrote:
> >> On Tue, Jun 2, 2020 at 8:25 AM Drouvot, Bertrand <bdrou...@amazon.com>
> wrote:
> >>> the patch adds into the LWLock struct:
> >>>
> >>>                      last_holding_pid: last pid owner of the lock
> >>>                      last_mode: last holding mode of the last pid
> owner of the lock
> >>>                      nholders: number of holders (could be >1 in case
> of LW_SHARED)
> >>
> >> There's been significant work done over the years to get the size of
> >> an LWLock down; I'm not very enthusiastic about making it bigger
> >> again. See for example commit 6150a1b08a9fe7ead2b25240be46dddeae9d98e1
> >> which embeds one of the LWLocks associated with a BufferDesc into the
> >> structure to reduce the number of cache lines associated with common
> >> buffer operations. I'm not sure whether this patch would increase the
> >> space usage of a BufferDesc to more than one cache line again, but at
> >> the very least it would make it a lot tighter, since it looks like it
> >> adds 12 bytes to the size of each one.
> >
> > +many. If anything I would like to make them *smaller*. We should strive
> > to make locking more and more granular, and that requires the space
> > overhead to be small.  I'm unhappy enough about the tranche being in
> > there, and requiring padding etc.
> >
> > I spent a *LOT* of sweat getting where we are, I'd be unhappy to regress
> > on size or efficiency.
>
> That seems to be the consensus, so I'm marking this as Returned with
> Feeback in the commitfest.
>

For what it's worth, I think that things like this are where we can really
benefit from external tracing and observation tools.

Instead of tracking the information persistently in the LWLock struct, we
can emit TRACE_POSTGRESQL_LWLOCK_BLOCKED_ON(...) in a context where we have
the information available to us, then forget all about it. We don't spend
anything unless someone's collecting the info.

If someone wants to track LWLock blocking relationships during debugging
and performance work, they can use systemtap, dtrace, bpftrace, or a
similar tool to observe the LWLock tracepoints and generate stats on LWLock
blocking frequencies/durations. Doing so with systemtap should be rather
simple.

I actually already had a look at this before. I found that the tracepoints
that're in the LWLock code right now don't supply enough information in
their arguments so you have to use DWARF debuginfo based probes, which is a
pain. The tranche name alone doesn't let you identify which lock within a
tranche is the current target.

I've attached a patch that adds the actual LWLock* to each tracepoint in
the LWLock subsystem. That permits different locks to be tracked when
handling tracepoint events within a single process.

Another patch adds tracepoints that were missing from LWLockUpdateVar and
LWLockWaitForVar. And another removes a stray
TRACE_POSTGRESQL_LWLOCK_ACQUIRE() in LWLockWaitForVar() which should not
have been there, since the lock is not actually acquired by
LWLockWaitForVar().

I'd hoped to add some sort of "index within the tranche" to tracepoints,
but it looks like it's not feasible. It turns out to be basically
impossible to get a stable identifier for an individual LWLock that is
valid across different backends. A LWLock inside a DSM segment might have a
different address in different backends depending on where the DSM segment
got mapped. The LWLock subsystem doesn't keep track of them and doesn't
have any way to map a LWLock pointer to any sort of cross-process-valid
identifier. So that's a bit of a pain when tracing. To provide something
stable I think  it'd be necessary to add some kind of counter tracked
per-tranche and set by LWLockInitialize in the LWLock struct itself, which
we sure don't want to do. If this ever becomes important for some reason we
can probably look up whether the address is within a DSM segment or static
shmem and compute some kind of relative address to report. For now you
can't identify and compare individual locks within a tranche except for
individual locks and named tranches.


By the way, the LWLock tracepoints currently fire T_NAME(lock) which calls
GetLWTrancheName() for each tracepoint hit, so long as Pg is built with
--enable-dtrace, even when nothing is actually tracing them. We might want
to consider guarding them in systemtap tracepoint semaphore tests so they
just become a predicted-away branch when not active. Doing so requires a
small change to how we compile probes.d and the related makefile, but
shouldn't be too hard. I haven't done that in this patch set.
From c8a0d97545a6d58d5ca1646146c63a687cb33079 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig.rin...@2ndquadrant.com>
Date: Thu, 19 Nov 2020 17:38:45 +0800
Subject: [PATCH v1 2/5] Pass the target LWLock* and tranche ID to LWLock
 tracepoints

Previously the TRACE_POSTGRESQL_LWLOCK_ tracepoints only received a
pointer to the LWLock tranche name. This made it impossible to identify
individual locks.

Passing the lock pointer itself isn't perfect. If the lock is allocated inside
a DSM segment then it might be mapped at a different address in different
backends. It's safe to compare lock pointers between backends (assuming
!EXEC_BACKEND) if they're in the individual lock tranches or an
extension-requested named tranche, but not necessarily for tranches in
BuiltinTrancheIds or tranches >= LWTRANCHE_FIRST_USER_DEFINED that were
directly assigned with LWLockNewTrancheId(). Still, it's better than nothing;
the pointer is stable within a backend, and usually between backends.
---
 src/backend/storage/lmgr/lwlock.c | 35 +++++++++++++++++++------------
 src/backend/utils/probes.d        | 18 +++++++++-------
 2 files changed, 32 insertions(+), 21 deletions(-)

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 0a147bddaf..92de34a399 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -1323,7 +1323,8 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
 #endif
 
 		LWLockReportWaitStart(lock);
-		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode);
+		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode, lock,
+				lock->tranche);
 
 		for (;;)
 		{
@@ -1345,7 +1346,8 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
 		}
 #endif
 
-		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode);
+		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode, lock,
+				lock->tranche);
 		LWLockReportWaitEnd();
 
 		LOG_LWDEBUG("LWLockAcquire", lock, "awakened");
@@ -1354,7 +1356,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
 		result = false;
 	}
 
-	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(lock), mode);
+	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(lock), mode, lock, lock->tranche);
 
 	/* Add lock to list of locks held by this backend */
 	held_lwlocks[num_held_lwlocks].lock = lock;
@@ -1405,14 +1407,16 @@ LWLockConditionalAcquire(LWLock *lock, LWLockMode mode)
 		RESUME_INTERRUPTS();
 
 		LOG_LWDEBUG("LWLockConditionalAcquire", lock, "failed");
-		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_FAIL(T_NAME(lock), mode);
+		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_FAIL(T_NAME(lock), mode, lock,
+				lock->tranche);
 	}
 	else
 	{
 		/* Add lock to list of locks held by this backend */
 		held_lwlocks[num_held_lwlocks].lock = lock;
 		held_lwlocks[num_held_lwlocks++].mode = mode;
-		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(T_NAME(lock), mode);
+		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(T_NAME(lock), mode, lock,
+				lock->tranche);
 	}
 	return !mustwait;
 }
@@ -1484,7 +1488,8 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
 #endif
 
 			LWLockReportWaitStart(lock);
-			TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode);
+			TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode, lock,
+					lock->tranche);
 
 			for (;;)
 			{
@@ -1502,7 +1507,8 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
 				Assert(nwaiters < MAX_BACKENDS);
 			}
 #endif
-			TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode);
+			TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode, lock,
+					lock->tranche);
 			LWLockReportWaitEnd();
 
 			LOG_LWDEBUG("LWLockAcquireOrWait", lock, "awakened");
@@ -1532,7 +1538,8 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
 		/* Failed to get lock, so release interrupt holdoff */
 		RESUME_INTERRUPTS();
 		LOG_LWDEBUG("LWLockAcquireOrWait", lock, "failed");
-		TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT_FAIL(T_NAME(lock), mode);
+		TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT_FAIL(T_NAME(lock), mode, lock,
+				lock->tranche);
 	}
 	else
 	{
@@ -1540,7 +1547,8 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
 		/* Add lock to list of locks held by this backend */
 		held_lwlocks[num_held_lwlocks].lock = lock;
 		held_lwlocks[num_held_lwlocks++].mode = mode;
-		TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT(T_NAME(lock), mode);
+		TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT(T_NAME(lock), mode, lock,
+				lock->tranche);
 	}
 
 	return !mustwait;
@@ -1700,7 +1708,8 @@ LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval)
 #endif
 
 		LWLockReportWaitStart(lock);
-		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), LW_EXCLUSIVE);
+		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), LW_EXCLUSIVE, lock,
+				lock->tranche);
 
 		for (;;)
 		{
@@ -1719,7 +1728,7 @@ LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval)
 		}
 #endif
 
-		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), LW_EXCLUSIVE);
+		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), LW_EXCLUSIVE, lock, lock->tranche);
 		LWLockReportWaitEnd();
 
 		LOG_LWDEBUG("LWLockWaitForVar", lock, "awakened");
@@ -1845,6 +1854,8 @@ LWLockRelease(LWLock *lock)
 	/* nobody else can have that kind of lock */
 	Assert(!(oldstate & LW_VAL_EXCLUSIVE));
 
+	/* Released, though not woken yet. All releases must fire this. */
+	TRACE_POSTGRESQL_LWLOCK_RELEASE(T_NAME(lock), mode, lock, lock->tranche);
 
 	/*
 	 * We're still waiting for backends to get scheduled, don't wake them up
@@ -1868,8 +1879,6 @@ LWLockRelease(LWLock *lock)
 		LWLockWakeup(lock);
 	}
 
-	TRACE_POSTGRESQL_LWLOCK_RELEASE(T_NAME(lock));
-
 	/*
 	 * Now okay to allow cancel/die interrupts.
 	 */
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index a0b0458108..89805c3a89 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -17,6 +17,7 @@
 #define LocalTransactionId unsigned int
 #define LWLockMode int
 #define LOCKMODE int
+#define LWLock void
 #define BlockNumber unsigned int
 #define Oid unsigned int
 #define ForkNumber int
@@ -28,14 +29,15 @@ provider postgresql {
 	probe transaction__commit(LocalTransactionId);
 	probe transaction__abort(LocalTransactionId);
 
-	probe lwlock__acquire(const char *, LWLockMode);
-	probe lwlock__release(const char *);
-	probe lwlock__wait__start(const char *, LWLockMode);
-	probe lwlock__wait__done(const char *, LWLockMode);
-	probe lwlock__condacquire(const char *, LWLockMode);
-	probe lwlock__condacquire__fail(const char *, LWLockMode);
-	probe lwlock__acquire__or__wait(const char *, LWLockMode);
-	probe lwlock__acquire__or__wait__fail(const char *, LWLockMode);
+	probe lwlock__acquire(const char *, LWLockMode, LWLock*, int);
+	probe lwlock__release(const char *, LWLockMode, LWLock*, int);
+	probe lwlock__wait__start(const char *, LWLockMode, LWLock*, int);
+	probe lwlock__wait__done(const char *, LWLockMode, LWLock*, int);
+	probe lwlock__condacquire(const char *, LWLockMode, LWLock*, int);
+	probe lwlock__condacquire__fail(const char *, LWLockMode, LWLock*, int);
+	probe lwlock__acquire__or__wait(const char *, LWLockMode, LWLock*, int);
+	probe lwlock__acquire__or__wait__fail(const char *, LWLockMode, LWLock*, int);
+
 
 	probe lock__wait__start(unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, LOCKMODE);
 	probe lock__wait__done(unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, LOCKMODE);
-- 
2.26.2

From 9969eb12b5a1fdd39cdc7b3b7acb28370e1f95aa Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig.rin...@2ndquadrant.com>
Date: Thu, 19 Nov 2020 18:15:34 +0800
Subject: [PATCH v1 1/5] Remove bogus lwlock__acquire tracepoint from
 LWLockWaitForVar

Calls to LWLockWaitForVar fired the TRACE_POSTGRESQL_LWLOCK_ACQUIRE
tracepoint, but LWLockWaitForVar() never actually acquires the
LWLock. Remove it.
---
 src/backend/storage/lmgr/lwlock.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 2fa90cc095..0a147bddaf 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -1727,8 +1727,6 @@ LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval)
 		/* Now loop back and check the status of the lock again. */
 	}
 
-	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(lock), LW_EXCLUSIVE);
-
 	/*
 	 * Fix the process wait semaphore's count for any absorbed wakeups.
 	 */
-- 
2.26.2

From 7c697ae06dff02bcd306c55aa6a79eabd58284fc Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig.rin...@2ndquadrant.com>
Date: Thu, 19 Nov 2020 18:05:39 +0800
Subject: [PATCH v1 3/5] Add to the tracepoints in LWLock routines

The existing tracepoints in lwlock.c didn't mark the start of LWLock
acquisition, so timing of the full LWLock acquire cycle wasn't
possible without relying on debuginfo. Since this can be quite
relevant for production performance issues, emit tracepoints at the
start of LWLock acquire.

Also add a tracepoint that's fired for all LWLock acquisitions at the
moment the shared memory state changes, whether done by LWLockAcquire
or LWLockConditionalAcquire. This lets tools reliably track which
backends hold which LWLocks even if we add new functions that acquire
LWLocks in future.

Add tracepoints in LWLockWaitForVar and LWLockUpdateVar so process
interaction around LWLock variable waits can be observed from trace
tooling. They can cause long waits and/or deadlocks, so it's worth
being able to time and track them.
---
 src/backend/storage/lmgr/lwlock.c | 24 ++++++++++++++++++++++++
 src/backend/utils/probes.d        |  8 ++++++++
 2 files changed, 32 insertions(+)

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 92de34a399..d63119931b 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -876,6 +876,9 @@ LWLockAttemptLock(LWLock *lock, LWLockMode mode)
 				if (mode == LW_EXCLUSIVE)
 					lock->owner = MyProc;
 #endif
+				/* All LWLock acquires must hit this tracepoint */
+				TRACE_POSTGRESQL_LWLOCK_ACQUIRED(T_NAME(lock), mode, lock,
+						lock->tranche);
 				return false;
 			}
 			else
@@ -1239,6 +1242,9 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
 	if (num_held_lwlocks >= MAX_SIMUL_LWLOCKS)
 		elog(ERROR, "too many LWLocks taken");
 
+	TRACE_POSTGRESQL_LWLOCK_ACQUIRE_START(T_NAME(lock), mode, lock,
+			lock->tranche);
+
 	/*
 	 * Lock out cancel/die interrupts until we exit the code section protected
 	 * by the LWLock.  This ensures that interrupts will not interfere with
@@ -1391,6 +1397,9 @@ LWLockConditionalAcquire(LWLock *lock, LWLockMode mode)
 	if (num_held_lwlocks >= MAX_SIMUL_LWLOCKS)
 		elog(ERROR, "too many LWLocks taken");
 
+	TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_START(T_NAME(lock), mode, lock,
+			lock->tranche);
+
 	/*
 	 * Lock out cancel/die interrupts until we exit the code section protected
 	 * by the LWLock.  This ensures that interrupts will not interfere with
@@ -1455,6 +1464,9 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
 	if (num_held_lwlocks >= MAX_SIMUL_LWLOCKS)
 		elog(ERROR, "too many LWLocks taken");
 
+	TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT_START(T_NAME(lock), mode, lock,
+			lock->tranche);
+
 	/*
 	 * Lock out cancel/die interrupts until we exit the code section protected
 	 * by the LWLock.  This ensures that interrupts will not interfere with
@@ -1637,6 +1649,9 @@ LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval)
 
 	PRINT_LWDEBUG("LWLockWaitForVar", lock, LW_WAIT_UNTIL_FREE);
 
+	TRACE_POSTGRESQL_LWLOCK_WAITFORVAR_START(T_NAME(lock), lock,
+			lock->tranche, valptr, oldval, *valptr);
+
 	/*
 	 * Lock out cancel/die interrupts while we sleep on the lock.  There is no
 	 * cleanup mechanism to remove us from the wait queue if we got
@@ -1747,6 +1762,9 @@ LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval)
 	 */
 	RESUME_INTERRUPTS();
 
+	TRACE_POSTGRESQL_LWLOCK_WAITFORVAR_DONE(T_NAME(lock), lock, lock->tranche,
+			valptr, oldval, *newval, result);
+
 	return result;
 }
 
@@ -1769,6 +1787,9 @@ LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
 
 	PRINT_LWDEBUG("LWLockUpdateVar", lock, LW_EXCLUSIVE);
 
+	TRACE_POSTGRESQL_LWLOCK_UPDATEVAR_START(T_NAME(lock), lock, lock->tranche, valptr,
+			val);
+
 	proclist_init(&wakeup);
 
 	LWLockWaitListLock(lock);
@@ -1809,6 +1830,9 @@ LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
 		waiter->lwWaiting = false;
 		PGSemaphoreUnlock(waiter->sem);
 	}
+
+	TRACE_POSTGRESQL_LWLOCK_UPDATEVAR_DONE(T_NAME(lock), lock, lock->tranche,
+			valptr, val);
 }
 
 
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index 89805c3a89..a62fdf61df 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -29,14 +29,22 @@ provider postgresql {
 	probe transaction__commit(LocalTransactionId);
 	probe transaction__abort(LocalTransactionId);
 
+	probe lwlock__acquired(const char *, LWLockMode, LWLock*, int);
+	probe lwlock__acquire__start(const char *, LWLockMode, LWLock*, int);
 	probe lwlock__acquire(const char *, LWLockMode, LWLock*, int);
 	probe lwlock__release(const char *, LWLockMode, LWLock*, int);
 	probe lwlock__wait__start(const char *, LWLockMode, LWLock*, int);
 	probe lwlock__wait__done(const char *, LWLockMode, LWLock*, int);
+	probe lwlock__condacquire__start(const char *, LWLockMode, LWLock*, int);
 	probe lwlock__condacquire(const char *, LWLockMode, LWLock*, int);
 	probe lwlock__condacquire__fail(const char *, LWLockMode, LWLock*, int);
+	probe lwlock__acquire__or__wait__start(const char *, LWLockMode, LWLock*, int);
 	probe lwlock__acquire__or__wait(const char *, LWLockMode, LWLock*, int);
 	probe lwlock__acquire__or__wait__fail(const char *, LWLockMode, LWLock*, int);
+	probe lwlock__waitforvar__start(const char *, LWLock*, int, uint64, uint64, uint64);
+	probe lwlock__waitforvar__done(const char *, LWLock*, int, uint64, uint64, uint64, bool);
+	probe lwlock__updatevar__start(const char *, LWLock*, int, uint64, uint64);
+	probe lwlock__updatevar__done(const char *, LWLock*, int, uint64, uint64);
 
 
 	probe lock__wait__start(unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, LOCKMODE);
-- 
2.26.2

Reply via email to