On Wed, Nov 18, 2020 at 5:25 PM Heikki Linnakangas <hlinn...@iki.fi> wrote:
> On 11/08/2020 03:41, Andres Freund wrote: > > On 2020-08-10 18:27:17 -0400, Robert Haas wrote: > >> On Tue, Jun 2, 2020 at 8:25 AM Drouvot, Bertrand <bdrou...@amazon.com> > wrote: > >>> the patch adds into the LWLock struct: > >>> > >>> last_holding_pid: last pid owner of the lock > >>> last_mode: last holding mode of the last pid > owner of the lock > >>> nholders: number of holders (could be >1 in case > of LW_SHARED) > >> > >> There's been significant work done over the years to get the size of > >> an LWLock down; I'm not very enthusiastic about making it bigger > >> again. See for example commit 6150a1b08a9fe7ead2b25240be46dddeae9d98e1 > >> which embeds one of the LWLocks associated with a BufferDesc into the > >> structure to reduce the number of cache lines associated with common > >> buffer operations. I'm not sure whether this patch would increase the > >> space usage of a BufferDesc to more than one cache line again, but at > >> the very least it would make it a lot tighter, since it looks like it > >> adds 12 bytes to the size of each one. > > > > +many. If anything I would like to make them *smaller*. We should strive > > to make locking more and more granular, and that requires the space > > overhead to be small. I'm unhappy enough about the tranche being in > > there, and requiring padding etc. > > > > I spent a *LOT* of sweat getting where we are, I'd be unhappy to regress > > on size or efficiency. > > That seems to be the consensus, so I'm marking this as Returned with > Feeback in the commitfest. > For what it's worth, I think that things like this are where we can really benefit from external tracing and observation tools. Instead of tracking the information persistently in the LWLock struct, we can emit TRACE_POSTGRESQL_LWLOCK_BLOCKED_ON(...) in a context where we have the information available to us, then forget all about it. We don't spend anything unless someone's collecting the info. If someone wants to track LWLock blocking relationships during debugging and performance work, they can use systemtap, dtrace, bpftrace, or a similar tool to observe the LWLock tracepoints and generate stats on LWLock blocking frequencies/durations. Doing so with systemtap should be rather simple. I actually already had a look at this before. I found that the tracepoints that're in the LWLock code right now don't supply enough information in their arguments so you have to use DWARF debuginfo based probes, which is a pain. The tranche name alone doesn't let you identify which lock within a tranche is the current target. I've attached a patch that adds the actual LWLock* to each tracepoint in the LWLock subsystem. That permits different locks to be tracked when handling tracepoint events within a single process. Another patch adds tracepoints that were missing from LWLockUpdateVar and LWLockWaitForVar. And another removes a stray TRACE_POSTGRESQL_LWLOCK_ACQUIRE() in LWLockWaitForVar() which should not have been there, since the lock is not actually acquired by LWLockWaitForVar(). I'd hoped to add some sort of "index within the tranche" to tracepoints, but it looks like it's not feasible. It turns out to be basically impossible to get a stable identifier for an individual LWLock that is valid across different backends. A LWLock inside a DSM segment might have a different address in different backends depending on where the DSM segment got mapped. The LWLock subsystem doesn't keep track of them and doesn't have any way to map a LWLock pointer to any sort of cross-process-valid identifier. So that's a bit of a pain when tracing. To provide something stable I think it'd be necessary to add some kind of counter tracked per-tranche and set by LWLockInitialize in the LWLock struct itself, which we sure don't want to do. If this ever becomes important for some reason we can probably look up whether the address is within a DSM segment or static shmem and compute some kind of relative address to report. For now you can't identify and compare individual locks within a tranche except for individual locks and named tranches. By the way, the LWLock tracepoints currently fire T_NAME(lock) which calls GetLWTrancheName() for each tracepoint hit, so long as Pg is built with --enable-dtrace, even when nothing is actually tracing them. We might want to consider guarding them in systemtap tracepoint semaphore tests so they just become a predicted-away branch when not active. Doing so requires a small change to how we compile probes.d and the related makefile, but shouldn't be too hard. I haven't done that in this patch set.
From c8a0d97545a6d58d5ca1646146c63a687cb33079 Mon Sep 17 00:00:00 2001 From: Craig Ringer <craig.rin...@2ndquadrant.com> Date: Thu, 19 Nov 2020 17:38:45 +0800 Subject: [PATCH v1 2/5] Pass the target LWLock* and tranche ID to LWLock tracepoints Previously the TRACE_POSTGRESQL_LWLOCK_ tracepoints only received a pointer to the LWLock tranche name. This made it impossible to identify individual locks. Passing the lock pointer itself isn't perfect. If the lock is allocated inside a DSM segment then it might be mapped at a different address in different backends. It's safe to compare lock pointers between backends (assuming !EXEC_BACKEND) if they're in the individual lock tranches or an extension-requested named tranche, but not necessarily for tranches in BuiltinTrancheIds or tranches >= LWTRANCHE_FIRST_USER_DEFINED that were directly assigned with LWLockNewTrancheId(). Still, it's better than nothing; the pointer is stable within a backend, and usually between backends. --- src/backend/storage/lmgr/lwlock.c | 35 +++++++++++++++++++------------ src/backend/utils/probes.d | 18 +++++++++------- 2 files changed, 32 insertions(+), 21 deletions(-) diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c index 0a147bddaf..92de34a399 100644 --- a/src/backend/storage/lmgr/lwlock.c +++ b/src/backend/storage/lmgr/lwlock.c @@ -1323,7 +1323,8 @@ LWLockAcquire(LWLock *lock, LWLockMode mode) #endif LWLockReportWaitStart(lock); - TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode, lock, + lock->tranche); for (;;) { @@ -1345,7 +1346,8 @@ LWLockAcquire(LWLock *lock, LWLockMode mode) } #endif - TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode, lock, + lock->tranche); LWLockReportWaitEnd(); LOG_LWDEBUG("LWLockAcquire", lock, "awakened"); @@ -1354,7 +1356,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode) result = false; } - TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(lock), mode, lock, lock->tranche); /* Add lock to list of locks held by this backend */ held_lwlocks[num_held_lwlocks].lock = lock; @@ -1405,14 +1407,16 @@ LWLockConditionalAcquire(LWLock *lock, LWLockMode mode) RESUME_INTERRUPTS(); LOG_LWDEBUG("LWLockConditionalAcquire", lock, "failed"); - TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_FAIL(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_FAIL(T_NAME(lock), mode, lock, + lock->tranche); } else { /* Add lock to list of locks held by this backend */ held_lwlocks[num_held_lwlocks].lock = lock; held_lwlocks[num_held_lwlocks++].mode = mode; - TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(T_NAME(lock), mode, lock, + lock->tranche); } return !mustwait; } @@ -1484,7 +1488,8 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode) #endif LWLockReportWaitStart(lock); - TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode, lock, + lock->tranche); for (;;) { @@ -1502,7 +1507,8 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode) Assert(nwaiters < MAX_BACKENDS); } #endif - TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode, lock, + lock->tranche); LWLockReportWaitEnd(); LOG_LWDEBUG("LWLockAcquireOrWait", lock, "awakened"); @@ -1532,7 +1538,8 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode) /* Failed to get lock, so release interrupt holdoff */ RESUME_INTERRUPTS(); LOG_LWDEBUG("LWLockAcquireOrWait", lock, "failed"); - TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT_FAIL(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT_FAIL(T_NAME(lock), mode, lock, + lock->tranche); } else { @@ -1540,7 +1547,8 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode) /* Add lock to list of locks held by this backend */ held_lwlocks[num_held_lwlocks].lock = lock; held_lwlocks[num_held_lwlocks++].mode = mode; - TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT(T_NAME(lock), mode); + TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT(T_NAME(lock), mode, lock, + lock->tranche); } return !mustwait; @@ -1700,7 +1708,8 @@ LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval) #endif LWLockReportWaitStart(lock); - TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), LW_EXCLUSIVE); + TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), LW_EXCLUSIVE, lock, + lock->tranche); for (;;) { @@ -1719,7 +1728,7 @@ LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval) } #endif - TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), LW_EXCLUSIVE); + TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), LW_EXCLUSIVE, lock, lock->tranche); LWLockReportWaitEnd(); LOG_LWDEBUG("LWLockWaitForVar", lock, "awakened"); @@ -1845,6 +1854,8 @@ LWLockRelease(LWLock *lock) /* nobody else can have that kind of lock */ Assert(!(oldstate & LW_VAL_EXCLUSIVE)); + /* Released, though not woken yet. All releases must fire this. */ + TRACE_POSTGRESQL_LWLOCK_RELEASE(T_NAME(lock), mode, lock, lock->tranche); /* * We're still waiting for backends to get scheduled, don't wake them up @@ -1868,8 +1879,6 @@ LWLockRelease(LWLock *lock) LWLockWakeup(lock); } - TRACE_POSTGRESQL_LWLOCK_RELEASE(T_NAME(lock)); - /* * Now okay to allow cancel/die interrupts. */ diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d index a0b0458108..89805c3a89 100644 --- a/src/backend/utils/probes.d +++ b/src/backend/utils/probes.d @@ -17,6 +17,7 @@ #define LocalTransactionId unsigned int #define LWLockMode int #define LOCKMODE int +#define LWLock void #define BlockNumber unsigned int #define Oid unsigned int #define ForkNumber int @@ -28,14 +29,15 @@ provider postgresql { probe transaction__commit(LocalTransactionId); probe transaction__abort(LocalTransactionId); - probe lwlock__acquire(const char *, LWLockMode); - probe lwlock__release(const char *); - probe lwlock__wait__start(const char *, LWLockMode); - probe lwlock__wait__done(const char *, LWLockMode); - probe lwlock__condacquire(const char *, LWLockMode); - probe lwlock__condacquire__fail(const char *, LWLockMode); - probe lwlock__acquire__or__wait(const char *, LWLockMode); - probe lwlock__acquire__or__wait__fail(const char *, LWLockMode); + probe lwlock__acquire(const char *, LWLockMode, LWLock*, int); + probe lwlock__release(const char *, LWLockMode, LWLock*, int); + probe lwlock__wait__start(const char *, LWLockMode, LWLock*, int); + probe lwlock__wait__done(const char *, LWLockMode, LWLock*, int); + probe lwlock__condacquire(const char *, LWLockMode, LWLock*, int); + probe lwlock__condacquire__fail(const char *, LWLockMode, LWLock*, int); + probe lwlock__acquire__or__wait(const char *, LWLockMode, LWLock*, int); + probe lwlock__acquire__or__wait__fail(const char *, LWLockMode, LWLock*, int); + probe lock__wait__start(unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, LOCKMODE); probe lock__wait__done(unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, LOCKMODE); -- 2.26.2
From 9969eb12b5a1fdd39cdc7b3b7acb28370e1f95aa Mon Sep 17 00:00:00 2001 From: Craig Ringer <craig.rin...@2ndquadrant.com> Date: Thu, 19 Nov 2020 18:15:34 +0800 Subject: [PATCH v1 1/5] Remove bogus lwlock__acquire tracepoint from LWLockWaitForVar Calls to LWLockWaitForVar fired the TRACE_POSTGRESQL_LWLOCK_ACQUIRE tracepoint, but LWLockWaitForVar() never actually acquires the LWLock. Remove it. --- src/backend/storage/lmgr/lwlock.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c index 2fa90cc095..0a147bddaf 100644 --- a/src/backend/storage/lmgr/lwlock.c +++ b/src/backend/storage/lmgr/lwlock.c @@ -1727,8 +1727,6 @@ LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval) /* Now loop back and check the status of the lock again. */ } - TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(lock), LW_EXCLUSIVE); - /* * Fix the process wait semaphore's count for any absorbed wakeups. */ -- 2.26.2
From 7c697ae06dff02bcd306c55aa6a79eabd58284fc Mon Sep 17 00:00:00 2001 From: Craig Ringer <craig.rin...@2ndquadrant.com> Date: Thu, 19 Nov 2020 18:05:39 +0800 Subject: [PATCH v1 3/5] Add to the tracepoints in LWLock routines The existing tracepoints in lwlock.c didn't mark the start of LWLock acquisition, so timing of the full LWLock acquire cycle wasn't possible without relying on debuginfo. Since this can be quite relevant for production performance issues, emit tracepoints at the start of LWLock acquire. Also add a tracepoint that's fired for all LWLock acquisitions at the moment the shared memory state changes, whether done by LWLockAcquire or LWLockConditionalAcquire. This lets tools reliably track which backends hold which LWLocks even if we add new functions that acquire LWLocks in future. Add tracepoints in LWLockWaitForVar and LWLockUpdateVar so process interaction around LWLock variable waits can be observed from trace tooling. They can cause long waits and/or deadlocks, so it's worth being able to time and track them. --- src/backend/storage/lmgr/lwlock.c | 24 ++++++++++++++++++++++++ src/backend/utils/probes.d | 8 ++++++++ 2 files changed, 32 insertions(+) diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c index 92de34a399..d63119931b 100644 --- a/src/backend/storage/lmgr/lwlock.c +++ b/src/backend/storage/lmgr/lwlock.c @@ -876,6 +876,9 @@ LWLockAttemptLock(LWLock *lock, LWLockMode mode) if (mode == LW_EXCLUSIVE) lock->owner = MyProc; #endif + /* All LWLock acquires must hit this tracepoint */ + TRACE_POSTGRESQL_LWLOCK_ACQUIRED(T_NAME(lock), mode, lock, + lock->tranche); return false; } else @@ -1239,6 +1242,9 @@ LWLockAcquire(LWLock *lock, LWLockMode mode) if (num_held_lwlocks >= MAX_SIMUL_LWLOCKS) elog(ERROR, "too many LWLocks taken"); + TRACE_POSTGRESQL_LWLOCK_ACQUIRE_START(T_NAME(lock), mode, lock, + lock->tranche); + /* * Lock out cancel/die interrupts until we exit the code section protected * by the LWLock. This ensures that interrupts will not interfere with @@ -1391,6 +1397,9 @@ LWLockConditionalAcquire(LWLock *lock, LWLockMode mode) if (num_held_lwlocks >= MAX_SIMUL_LWLOCKS) elog(ERROR, "too many LWLocks taken"); + TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_START(T_NAME(lock), mode, lock, + lock->tranche); + /* * Lock out cancel/die interrupts until we exit the code section protected * by the LWLock. This ensures that interrupts will not interfere with @@ -1455,6 +1464,9 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode) if (num_held_lwlocks >= MAX_SIMUL_LWLOCKS) elog(ERROR, "too many LWLocks taken"); + TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT_START(T_NAME(lock), mode, lock, + lock->tranche); + /* * Lock out cancel/die interrupts until we exit the code section protected * by the LWLock. This ensures that interrupts will not interfere with @@ -1637,6 +1649,9 @@ LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval) PRINT_LWDEBUG("LWLockWaitForVar", lock, LW_WAIT_UNTIL_FREE); + TRACE_POSTGRESQL_LWLOCK_WAITFORVAR_START(T_NAME(lock), lock, + lock->tranche, valptr, oldval, *valptr); + /* * Lock out cancel/die interrupts while we sleep on the lock. There is no * cleanup mechanism to remove us from the wait queue if we got @@ -1747,6 +1762,9 @@ LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval) */ RESUME_INTERRUPTS(); + TRACE_POSTGRESQL_LWLOCK_WAITFORVAR_DONE(T_NAME(lock), lock, lock->tranche, + valptr, oldval, *newval, result); + return result; } @@ -1769,6 +1787,9 @@ LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val) PRINT_LWDEBUG("LWLockUpdateVar", lock, LW_EXCLUSIVE); + TRACE_POSTGRESQL_LWLOCK_UPDATEVAR_START(T_NAME(lock), lock, lock->tranche, valptr, + val); + proclist_init(&wakeup); LWLockWaitListLock(lock); @@ -1809,6 +1830,9 @@ LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val) waiter->lwWaiting = false; PGSemaphoreUnlock(waiter->sem); } + + TRACE_POSTGRESQL_LWLOCK_UPDATEVAR_DONE(T_NAME(lock), lock, lock->tranche, + valptr, val); } diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d index 89805c3a89..a62fdf61df 100644 --- a/src/backend/utils/probes.d +++ b/src/backend/utils/probes.d @@ -29,14 +29,22 @@ provider postgresql { probe transaction__commit(LocalTransactionId); probe transaction__abort(LocalTransactionId); + probe lwlock__acquired(const char *, LWLockMode, LWLock*, int); + probe lwlock__acquire__start(const char *, LWLockMode, LWLock*, int); probe lwlock__acquire(const char *, LWLockMode, LWLock*, int); probe lwlock__release(const char *, LWLockMode, LWLock*, int); probe lwlock__wait__start(const char *, LWLockMode, LWLock*, int); probe lwlock__wait__done(const char *, LWLockMode, LWLock*, int); + probe lwlock__condacquire__start(const char *, LWLockMode, LWLock*, int); probe lwlock__condacquire(const char *, LWLockMode, LWLock*, int); probe lwlock__condacquire__fail(const char *, LWLockMode, LWLock*, int); + probe lwlock__acquire__or__wait__start(const char *, LWLockMode, LWLock*, int); probe lwlock__acquire__or__wait(const char *, LWLockMode, LWLock*, int); probe lwlock__acquire__or__wait__fail(const char *, LWLockMode, LWLock*, int); + probe lwlock__waitforvar__start(const char *, LWLock*, int, uint64, uint64, uint64); + probe lwlock__waitforvar__done(const char *, LWLock*, int, uint64, uint64, uint64, bool); + probe lwlock__updatevar__start(const char *, LWLock*, int, uint64, uint64); + probe lwlock__updatevar__done(const char *, LWLock*, int, uint64, uint64); probe lock__wait__start(unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, LOCKMODE); -- 2.26.2