Hi Hackers,
I am reaching out to solicit your insights and comments on this patch
addressing a significant performance bottleneck in LWLock acquisition
observed on high-core-count systems. During performance analysis of
HammerDB/TPCC (192 virtual users, 757 warehouses) on a 384-vCPU Intel
system, we found that LWLockAttemptLock consumed 7.12% of total CPU
cycles. This bottleneck becomes even more pronounced (up to 30% of
cycles) after applying lock-free WAL optimizations[1][2].
Problem Analysis:
The current LWLock implementation uses separate atomic operations for
state checking and modification. For shared locks (84% of
LWLockAttemptLock calls), this requires:
1.Atomic read of lock->state
2.State modification
3.Atomic compare-exchange (with retries on contention)
This design causes excessive atomic operations on contended locks, which
are particularly expensive on high-core-count systems where cache-line
bouncing amplifies synchronization costs.
Optimization Approach:
The patch optimizes shared lock acquisition by:
1.Merging state read and update into a single atomic add operation
2.Extending LW_SHARED_MASK by 1 bit and shifting LW_VAL_EXCLUSIVE
3.Adding a willwait parameter to control optimization usage
Key implementation details:
- For LW_SHARED with willwait=true: Uses atomic fetch-add to increment
reference count
- Maintains backward compatibility through state mask adjustments
- Preserves existing behavior for:
1) Exclusive locks
2) Non-waiting cases (LWLockConditionalAcquire)
- Bounds shared lock count to MAX_BACKENDS*2 (handled via mask extension)
Performance Impact:
Testing on a 384-vCPU Intel system shows:
- *8%* NOPM improvement in HammerDB/TPCC with this optimization alone
- *46%* cumulative improvement when combined with lock-free WAL
optimizations[1][2]
Patch Contents:
1.Extends shared mask and shifts exclusive lock value
2.Adds willwait parameter to control optimization
3.Updates lock acquisition/release logic
4.Maintains all existing assertions and safety checks
The optimization is particularly effective for contended shared locks,
which are common in buffer mapping, lock manager, and shared buffer
access patterns.
Please review this patch for consideration in upcoming PostgreSQL releases.
[1] Lock-free XLog Reservation from WAL:
https://www.postgresql.org/message-id/flat/PH7PR11MB5796659F654F9BE983F3AD97EF142%40PH7PR11MB5796.namprd11.prod.outlook.com
[2] Increase NUM_XLOGINSERT_LOCKS:
https://www.postgresql.org/message-id/flat/3b11fdc2-9793-403d-b3d4-67ff9a00d447%40postgrespro.ru
Regards,
Zhiguo
From 7fa7181655627d7ed5eafb33341621d1965c5312 Mon Sep 17 00:00:00 2001
From: Zhiguo Zhou <zhiguo.z...@intel.com>
Date: Thu, 29 May 2025 16:55:42 +0800
Subject: [PATCH] Optimize shared LWLock acquisition for high-core-count
systems
This patch introduces optimizations to reduce lock acquisition overhead in
LWLock by merging the read and update operations for the LW_SHARED lock's
state. This eliminates the need for separate atomic instructions, which is
critical for improving performance on high-core-count systems.
Key changes:
- Extended LW_SHARED_MASK by 1 bit and shifted LW_VAL_EXCLUSIVE by 1 bit to
ensure compatibility with the upper bound of MAX_BACKENDS * 2.
- Added a `willwait` parameter to `LWLockAttemptLock` to disable the
optimization when the caller is unwilling to wait, avoiding conflicts
between the reference count and the LW_VAL_EXCLUSIVE flag.
- Updated `LWLockReleaseInternal` to use `pg_atomic_fetch_and_u32` for
clearing lock state flags atomically.
- Adjusted related functions (`LWLockAcquire`, `LWLockConditionalAcquire`,
`LWLockAcquireOrWait`) to pass the `willwait` parameter appropriately.
These changes improve scalability and reduce contention in workloads with
frequent LWLock operations on servers with many cores.
---
src/backend/storage/lmgr/lwlock.c | 73 ++++++++++++++++++++++++-------
1 file changed, 57 insertions(+), 16 deletions(-)
diff --git a/src/backend/storage/lmgr/lwlock.c
b/src/backend/storage/lmgr/lwlock.c
index 46f44bc4511..4c29016ce35 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -97,20 +97,41 @@
#define LW_FLAG_BITS 3
#define LW_FLAG_MASK
(((1<<LW_FLAG_BITS)-1)<<(32-LW_FLAG_BITS))
-/* assumes MAX_BACKENDS is a (power of 2) - 1, checked below */
-#define LW_VAL_EXCLUSIVE (MAX_BACKENDS + 1)
+/*
+ * already (power of 2)-1, i.e. suitable for a mask
+ *
+ * Originally, the LW_SHARED lock reference count was maintained in bits
+ * [MAX_BACKEND_BITS-1:0] of LWLock.state, with a theoretical maximum of
+ * MAX_BACKENDS (when all MAX_BACKENDS processes hold the lock concurrently).
+ *
+ * To reduce lock acquisition overhead, we optimized LWLockAttemptLock by
+ * merging the read and update operations for the LW_SHARED lock's state.
+ * This eliminates the need for separate atomic instructions - a critical
+ * improvement given the high cost of atomic operations on high-core-count
+ * systems.
+ *
+ * This optimization introduces a scenario where the reference count may
+ * temporarily increment even when a reader fails to acquire an exclusive lock.
+ * However, since each process retries lock acquisition up to *twice* before
+ * waiting on a semaphore, the reference count is bounded by MAX_BACKENDS * 2.
+ *
+ * To ensure compatibility with this upper bound:
+ * 1. LW_SHARED_MASK has been extended by 1 bit
+ * 2. LW_VAL_EXCLUSIVE is left-shifted by 1 bit
+ */
+#define LW_SHARED_MASK ((MAX_BACKENDS << 1) + 1)
+#define LW_VAL_EXCLUSIVE (LW_SHARED_MASK + 1)
+#define LW_LOCK_MASK (LW_SHARED_MASK |
LW_VAL_EXCLUSIVE)
#define LW_VAL_SHARED 1
-/* already (power of 2)-1, i.e. suitable for a mask */
-#define LW_SHARED_MASK MAX_BACKENDS
-#define LW_LOCK_MASK (MAX_BACKENDS |
LW_VAL_EXCLUSIVE)
+/* assumes MAX_BACKENDS is a (power of 2) - 1, checked below */
StaticAssertDecl(((MAX_BACKENDS + 1) & MAX_BACKENDS) == 0,
"MAX_BACKENDS + 1 needs to be a power of 2");
-StaticAssertDecl((MAX_BACKENDS & LW_FLAG_MASK) == 0,
- "MAX_BACKENDS and LW_FLAG_MASK overlap");
+StaticAssertDecl((LW_SHARED_MASK & LW_FLAG_MASK) == 0,
+ "LW_SHARED_MASK and LW_FLAG_MASK overlap");
StaticAssertDecl((LW_VAL_EXCLUSIVE & LW_FLAG_MASK) == 0,
"LW_VAL_EXCLUSIVE and LW_FLAG_MASK overlap");
@@ -277,6 +298,8 @@ PRINT_LWDEBUG(const char *where, LWLock *lock, LWLockMode
mode)
if (Trace_lwlocks)
{
uint32 state = pg_atomic_read_u32(&lock->state);
+ uint32 excl = (state & LW_VAL_EXCLUSIVE) != 0;
+ uint32 shared = excl ? 0 : state & LW_SHARED_MASK;
ereport(LOG,
(errhidestmt(true),
@@ -284,8 +307,8 @@ PRINT_LWDEBUG(const char *where, LWLock *lock, LWLockMode
mode)
errmsg_internal("%d: %s(%s %p): excl %u shared
%u haswaiters %u waiters %u rOK %d",
MyProcPid,
where,
T_NAME(lock), lock,
- (state &
LW_VAL_EXCLUSIVE) != 0,
- state &
LW_SHARED_MASK,
+ excl,
+ shared,
(state &
LW_FLAG_HAS_WAITERS) != 0,
pg_atomic_read_u32(&lock->nwaiters),
(state &
LW_FLAG_RELEASE_OK) != 0)));
@@ -790,15 +813,30 @@ GetLWLockIdentifier(uint32 classId, uint16 eventId)
* This function will not block waiting for a lock to become free - that's the
* caller's job.
*
+ * willwait: true if the caller is willing to wait for the lock to become free
+ * false if the caller is not willing to wait.
+ *
* Returns true if the lock isn't free and we need to wait.
*/
static bool
-LWLockAttemptLock(LWLock *lock, LWLockMode mode)
+LWLockAttemptLock(LWLock *lock, LWLockMode mode, bool willwait)
{
uint32 old_state;
Assert(mode == LW_EXCLUSIVE || mode == LW_SHARED);
+ /*
+ * To avoid conflicts between the reference count and the
LW_VAL_EXCLUSIVE
+ * flag, this optimization is disabled when willwait is false. See
detailed
+ * comments in this file where LW_SHARED_MASK is defined for more
explaination.
+ */
+ if (willwait && mode == LW_SHARED)
+ {
+ old_state = pg_atomic_fetch_add_u32(&lock->state,
LW_VAL_SHARED);
+ Assert((old_state & LW_LOCK_MASK) != LW_LOCK_MASK);
+ return (old_state & LW_VAL_EXCLUSIVE) != 0;
+ }
+
/*
* Read once outside the loop, later iterations will get the newer value
* via compare & exchange.
@@ -1242,7 +1280,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
* Try to grab the lock the first time, we're not in the
waitqueue
* yet/anymore.
*/
- mustwait = LWLockAttemptLock(lock, mode);
+ mustwait = LWLockAttemptLock(lock, mode, true);
if (!mustwait)
{
@@ -1265,7 +1303,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
LWLockQueueSelf(lock, mode);
/* we're now guaranteed to be woken up if necessary */
- mustwait = LWLockAttemptLock(lock, mode);
+ mustwait = LWLockAttemptLock(lock, mode, true);
/* ok, grabbed the lock the second time round, need to undo
queueing */
if (!mustwait)
@@ -1368,7 +1406,7 @@ LWLockConditionalAcquire(LWLock *lock, LWLockMode mode)
HOLD_INTERRUPTS();
/* Check for the lock */
- mustwait = LWLockAttemptLock(lock, mode);
+ mustwait = LWLockAttemptLock(lock, mode, false);
if (mustwait)
{
@@ -1435,13 +1473,13 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
* NB: We're using nearly the same twice-in-a-row lock acquisition
* protocol as LWLockAcquire(). Check its comments for details.
*/
- mustwait = LWLockAttemptLock(lock, mode);
+ mustwait = LWLockAttemptLock(lock, mode, true);
if (mustwait)
{
LWLockQueueSelf(lock, LW_WAIT_UNTIL_FREE);
- mustwait = LWLockAttemptLock(lock, mode);
+ mustwait = LWLockAttemptLock(lock, mode, true);
if (mustwait)
{
@@ -1843,7 +1881,10 @@ LWLockReleaseInternal(LWLock *lock, LWLockMode mode)
* others, even if we still have to wakeup other waiters.
*/
if (mode == LW_EXCLUSIVE)
- oldstate = pg_atomic_sub_fetch_u32(&lock->state,
LW_VAL_EXCLUSIVE);
+ {
+ oldstate = pg_atomic_fetch_and_u32(&lock->state, ~LW_LOCK_MASK);
+ oldstate &= ~LW_LOCK_MASK;
+ }
else
oldstate = pg_atomic_sub_fetch_u32(&lock->state, LW_VAL_SHARED);
--
2.43.0