Optimize shared LWLock acquisition for high-core-count systems

Zhou, Zhiguo Fri, 30 May 2025 04:31:41 -0700

Hi Hackers,

I am reaching out to solicit your insights and comments on this patchaddressing a significant performance bottleneck in LWLock acquisitionobserved on high-core-count systems. During performance analysis ofHammerDB/TPCC (192 virtual users, 757 warehouses) on a 384-vCPU Intelsystem, we found that LWLockAttemptLock consumed 7.12% of total CPUcycles. This bottleneck becomes even more pronounced (up to 30% ofcycles) after applying lock-free WAL optimizations[1][2].


Problem Analysis:

The current LWLock implementation uses separate atomic operations forstate checking and modification. For shared locks (84% ofLWLockAttemptLock calls), this requires:

1.Atomic read of lock->state
2.State modification
3.Atomic compare-exchange (with retries on contention)

This design causes excessive atomic operations on contended locks, whichare particularly expensive on high-core-count systems where cache-linebouncing amplifies synchronization costs.


Optimization Approach:
The patch optimizes shared lock acquisition by:
1.Merging state read and update into a single atomic add operation
2.Extending LW_SHARED_MASK by 1 bit and shifting LW_VAL_EXCLUSIVE
3.Adding a willwait parameter to control optimization usage

Key implementation details:

- For LW_SHARED with willwait=true: Uses atomic fetch-add to incrementreference count

- Maintains backward compatibility through state mask adjustments
- Preserves existing behavior for:
  1) Exclusive locks
  2) Non-waiting cases (LWLockConditionalAcquire)
- Bounds shared lock count to MAX_BACKENDS*2 (handled via mask extension)

Performance Impact:
Testing on a 384-vCPU Intel system shows:
- *8%* NOPM improvement in HammerDB/TPCC with this optimization alone

- *46%* cumulative improvement when combined with lock-free WALoptimizations[1][2]


Patch Contents:
1.Extends shared mask and shifts exclusive lock value
2.Adds willwait parameter to control optimization
3.Updates lock acquisition/release logic
4.Maintains all existing assertions and safety checks

The optimization is particularly effective for contended shared locks,which are common in buffer mapping, lock manager, and shared bufferaccess patterns.


Please review this patch for consideration in upcoming PostgreSQL releases.

[1] Lock-free XLog Reservation from WAL:https://www.postgresql.org/message-id/flat/PH7PR11MB5796659F654F9BE983F3AD97EF142%40PH7PR11MB5796.namprd11.prod.outlook.com[2] Increase NUM_XLOGINSERT_LOCKS:https://www.postgresql.org/message-id/flat/3b11fdc2-9793-403d-b3d4-67ff9a00d447%40postgrespro.ru


Regards,
Zhiguo

From 7fa7181655627d7ed5eafb33341621d1965c5312 Mon Sep 17 00:00:00 2001
From: Zhiguo Zhou <zhiguo.z...@intel.com>
Date: Thu, 29 May 2025 16:55:42 +0800
Subject: [PATCH] Optimize shared LWLock acquisition for high-core-count
 systems

This patch introduces optimizations to reduce lock acquisition overhead in
LWLock by merging the read and update operations for the LW_SHARED lock's
state. This eliminates the need for separate atomic instructions, which is
critical for improving performance on high-core-count systems.

Key changes:
- Extended LW_SHARED_MASK by 1 bit and shifted LW_VAL_EXCLUSIVE by 1 bit to
  ensure compatibility with the upper bound of MAX_BACKENDS * 2.
- Added a `willwait` parameter to `LWLockAttemptLock` to disable the
  optimization when the caller is unwilling to wait, avoiding conflicts
  between the reference count and the LW_VAL_EXCLUSIVE flag.
- Updated `LWLockReleaseInternal` to use `pg_atomic_fetch_and_u32` for
  clearing lock state flags atomically.
- Adjusted related functions (`LWLockAcquire`, `LWLockConditionalAcquire`,
  `LWLockAcquireOrWait`) to pass the `willwait` parameter appropriately.

These changes improve scalability and reduce contention in workloads with
frequent LWLock operations on servers with many cores.
---
 src/backend/storage/lmgr/lwlock.c | 73 ++++++++++++++++++++++++-------
 1 file changed, 57 insertions(+), 16 deletions(-)

diff --git a/src/backend/storage/lmgr/lwlock.c 
b/src/backend/storage/lmgr/lwlock.c
index 46f44bc4511..4c29016ce35 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -97,20 +97,41 @@
 #define LW_FLAG_BITS                           3
 #define LW_FLAG_MASK                           
(((1<<LW_FLAG_BITS)-1)<<(32-LW_FLAG_BITS))
 
-/* assumes MAX_BACKENDS is a (power of 2) - 1, checked below */
-#define LW_VAL_EXCLUSIVE                       (MAX_BACKENDS + 1)
+/*
+ * already (power of 2)-1, i.e. suitable for a mask
+ *
+ * Originally, the LW_SHARED lock reference count was maintained in bits
+ * [MAX_BACKEND_BITS-1:0] of LWLock.state, with a theoretical maximum of
+ * MAX_BACKENDS (when all MAX_BACKENDS processes hold the lock concurrently).
+ *
+ * To reduce lock acquisition overhead, we optimized LWLockAttemptLock by
+ * merging the read and update operations for the LW_SHARED lock's state.
+ * This eliminates the need for separate atomic instructions - a critical
+ * improvement given the high cost of atomic operations on high-core-count
+ * systems.
+ *
+ * This optimization introduces a scenario where the reference count may
+ * temporarily increment even when a reader fails to acquire an exclusive lock.
+ * However, since each process retries lock acquisition up to *twice* before
+ * waiting on a semaphore, the reference count is bounded by MAX_BACKENDS * 2.
+ *
+ * To ensure compatibility with this upper bound:
+ * 1. LW_SHARED_MASK has been extended by 1 bit
+ * 2. LW_VAL_EXCLUSIVE is left-shifted by 1 bit
+ */
+#define LW_SHARED_MASK                         ((MAX_BACKENDS << 1) + 1)
+#define LW_VAL_EXCLUSIVE                       (LW_SHARED_MASK + 1)
+#define LW_LOCK_MASK                           (LW_SHARED_MASK | 
LW_VAL_EXCLUSIVE)
 #define LW_VAL_SHARED                          1
 
-/* already (power of 2)-1, i.e. suitable for a mask */
-#define LW_SHARED_MASK                         MAX_BACKENDS
-#define LW_LOCK_MASK                           (MAX_BACKENDS | 
LW_VAL_EXCLUSIVE)
+/* assumes MAX_BACKENDS is a (power of 2) - 1, checked below */
 
 
 StaticAssertDecl(((MAX_BACKENDS + 1) & MAX_BACKENDS) == 0,
                                 "MAX_BACKENDS + 1 needs to be a power of 2");
 
-StaticAssertDecl((MAX_BACKENDS & LW_FLAG_MASK) == 0,
-                                "MAX_BACKENDS and LW_FLAG_MASK overlap");
+StaticAssertDecl((LW_SHARED_MASK & LW_FLAG_MASK) == 0,
+                                "LW_SHARED_MASK and LW_FLAG_MASK overlap");
 
 StaticAssertDecl((LW_VAL_EXCLUSIVE & LW_FLAG_MASK) == 0,
                                 "LW_VAL_EXCLUSIVE and LW_FLAG_MASK overlap");
@@ -277,6 +298,8 @@ PRINT_LWDEBUG(const char *where, LWLock *lock, LWLockMode 
mode)
        if (Trace_lwlocks)
        {
                uint32          state = pg_atomic_read_u32(&lock->state);
+               uint32          excl = (state & LW_VAL_EXCLUSIVE) != 0;
+               uint32          shared = excl ? 0 : state & LW_SHARED_MASK;
 
                ereport(LOG,
                                (errhidestmt(true),
@@ -284,8 +307,8 @@ PRINT_LWDEBUG(const char *where, LWLock *lock, LWLockMode 
mode)
                                 errmsg_internal("%d: %s(%s %p): excl %u shared 
%u haswaiters %u waiters %u rOK %d",
                                                                 MyProcPid,
                                                                 where, 
T_NAME(lock), lock,
-                                                                (state & 
LW_VAL_EXCLUSIVE) != 0,
-                                                                state & 
LW_SHARED_MASK,
+                                                                excl,
+                                                                shared,
                                                                 (state & 
LW_FLAG_HAS_WAITERS) != 0,
                                                                 
pg_atomic_read_u32(&lock->nwaiters),
                                                                 (state & 
LW_FLAG_RELEASE_OK) != 0)));
@@ -790,15 +813,30 @@ GetLWLockIdentifier(uint32 classId, uint16 eventId)
  * This function will not block waiting for a lock to become free - that's the
  * caller's job.
  *
+ * willwait: true if the caller is willing to wait for the lock to become free
+ *          false if the caller is not willing to wait.
+ *
  * Returns true if the lock isn't free and we need to wait.
  */
 static bool
-LWLockAttemptLock(LWLock *lock, LWLockMode mode)
+LWLockAttemptLock(LWLock *lock, LWLockMode mode, bool willwait)
 {
        uint32          old_state;
 
        Assert(mode == LW_EXCLUSIVE || mode == LW_SHARED);
 
+       /*
+        * To avoid conflicts between the reference count and the 
LW_VAL_EXCLUSIVE
+        * flag, this optimization is disabled when willwait is false. See 
detailed
+        * comments in this file where LW_SHARED_MASK is defined for more 
explaination.
+        */
+       if (willwait && mode == LW_SHARED)
+       {
+               old_state = pg_atomic_fetch_add_u32(&lock->state, 
LW_VAL_SHARED);
+               Assert((old_state & LW_LOCK_MASK) != LW_LOCK_MASK);
+               return (old_state & LW_VAL_EXCLUSIVE) != 0;
+       }
+
        /*
         * Read once outside the loop, later iterations will get the newer value
         * via compare & exchange.
@@ -1242,7 +1280,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
                 * Try to grab the lock the first time, we're not in the 
waitqueue
                 * yet/anymore.
                 */
-               mustwait = LWLockAttemptLock(lock, mode);
+               mustwait = LWLockAttemptLock(lock, mode, true);
 
                if (!mustwait)
                {
@@ -1265,7 +1303,7 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
                LWLockQueueSelf(lock, mode);
 
                /* we're now guaranteed to be woken up if necessary */
-               mustwait = LWLockAttemptLock(lock, mode);
+               mustwait = LWLockAttemptLock(lock, mode, true);
 
                /* ok, grabbed the lock the second time round, need to undo 
queueing */
                if (!mustwait)
@@ -1368,7 +1406,7 @@ LWLockConditionalAcquire(LWLock *lock, LWLockMode mode)
        HOLD_INTERRUPTS();
 
        /* Check for the lock */
-       mustwait = LWLockAttemptLock(lock, mode);
+       mustwait = LWLockAttemptLock(lock, mode, false);
 
        if (mustwait)
        {
@@ -1435,13 +1473,13 @@ LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
         * NB: We're using nearly the same twice-in-a-row lock acquisition
         * protocol as LWLockAcquire(). Check its comments for details.
         */
-       mustwait = LWLockAttemptLock(lock, mode);
+       mustwait = LWLockAttemptLock(lock, mode, true);
 
        if (mustwait)
        {
                LWLockQueueSelf(lock, LW_WAIT_UNTIL_FREE);
 
-               mustwait = LWLockAttemptLock(lock, mode);
+               mustwait = LWLockAttemptLock(lock, mode, true);
 
                if (mustwait)
                {
@@ -1843,7 +1881,10 @@ LWLockReleaseInternal(LWLock *lock, LWLockMode mode)
         * others, even if we still have to wakeup other waiters.
         */
        if (mode == LW_EXCLUSIVE)
-               oldstate = pg_atomic_sub_fetch_u32(&lock->state, 
LW_VAL_EXCLUSIVE);
+       {
+               oldstate = pg_atomic_fetch_and_u32(&lock->state, ~LW_LOCK_MASK);
+               oldstate &= ~LW_LOCK_MASK;
+       }
        else
                oldstate = pg_atomic_sub_fetch_u32(&lock->state, LW_VAL_SHARED);
 
-- 
2.43.0

Optimize shared LWLock acquisition for high-core-count systems

Reply via email to