Re: Optimize shared LWLock acquisition for high-core-count systems

Zhou, Zhiguo Thu, 10 Jul 2025 08:57:36 -0700



On 7/9/2025 3:56 PM, Yura Sokolov wrote:

30.05.2025 14:30, Zhou, Zhiguo пишет:

Hi Hackers,

I am reaching out to solicit your insights and comments on this patch
addressing a significant performance bottleneck in LWLock acquisition
observed on high-core-count systems. During performance analysis of
HammerDB/TPCC (192 virtual users, 757 warehouses) on a 384-vCPU Intel
system, we found that LWLockAttemptLock consumed 7.12% of total CPU
cycles. This bottleneck becomes even more pronounced (up to 30% of
cycles) after applying lock-free WAL optimizations[1][2].

Problem Analysis:
The current LWLock implementation uses separate atomic operations for
state checking and modification. For shared locks (84% of
LWLockAttemptLock calls), this requires:
1.Atomic read of lock->state
2.State modification
3.Atomic compare-exchange (with retries on contention)

This design causes excessive atomic operations on contended locks, which
are particularly expensive on high-core-count systems where cache-line
bouncing amplifies synchronization costs.

Optimization Approach:
The patch optimizes shared lock acquisition by:
1.Merging state read and update into a single atomic add operation
2.Extending LW_SHARED_MASK by 1 bit and shifting LW_VAL_EXCLUSIVE
3.Adding a willwait parameter to control optimization usage

Key implementation details:
- For LW_SHARED with willwait=true: Uses atomic fetch-add to increment
reference count
- Maintains backward compatibility through state mask adjustments
- Preserves existing behavior for:
    1) Exclusive locks
    2) Non-waiting cases (LWLockConditionalAcquire)
- Bounds shared lock count to MAX_BACKENDS*2 (handled via mask extension)

Performance Impact:
Testing on a 384-vCPU Intel system shows:
- *8%* NOPM improvement in HammerDB/TPCC with this optimization alone
- *46%* cumulative improvement when combined with lock-free WAL
optimizations[1][2]

Patch Contents:
1.Extends shared mask and shifts exclusive lock value
2.Adds willwait parameter to control optimization
3.Updates lock acquisition/release logic
4.Maintains all existing assertions and safety checks

The optimization is particularly effective for contended shared locks,
which are common in buffer mapping, lock manager, and shared buffer
access patterns.

Please review this patch for consideration in upcoming PostgreSQL releases.

[1] Lock-free XLog Reservation from WAL:
https://www.postgresql.org/message-id/flat/PH7PR11MB5796659F654F9BE983F3AD97EF142%40PH7PR11MB5796.namprd11.prod.outlook.com
[2] Increase NUM_XLOGINSERT_LOCKS:
https://www.postgresql.org/message-id/flat/3b11fdc2-9793-403d-b3d4-67ff9a00d447%40postgrespro.ru


Good day, Zhou.

Could you explain, why your patch is correct?

As far as I understand, it is clearly not correct at this time:
- SHARED lock count may be incremented many times, because of for(;;) loop
in LWLockAcquire and because LWLockAttemptLock is called twice per each
loop iteration in case lock is held in Exclusive mode by someone else.

If your patch is correct, then where I'm wrong?

When I tried to do same thing, I did sub_fetch immediately in case of
acquisition failure. And did no add_fetch at all if lock is held in
Exclusive mode.

BTW, there's way to optimize EXCLUSIVE lock as well since there's no need
to do CAS if lock is held by someone else.

See my version in attach...


Good day, Yura.

Thanks for your comments which identifies this critical safety condition– you're absolutely right that we must guarantee the shared referencecount never overflows into the exclusive bit. Let me clarify the safetymechanism:

When a reader encounters an exclusive lock during acquisition(triggering the for(;;) loop), it does temporarily increment the sharedcount twice – once per LWLockAttemptLock attempt. However, bothincrements occur before the reader waits on its semaphore(PGSemaphoreLock(proc->sem)). Crucially, when the exclusive holderreleases the lock via LWLockReleaseInternal, it resets the entire lockstate (line 1883: pg_atomic_fetch_and_u32(&lock->state,~LW_LOCK_MASK)). This clears all reader references, including any"overcounted" increments from blocked readers.


Thus, when blocked readers wake:

1. They retry acquisition on a zero-initialized state
2. Each ultimately increments only once for successful acquisition

3. The transient "overcount" (≤ MAX_BACKENDS × 2) stays safely withinLW_SHARED_MASK


The key invariants are:

- LW_SHARED_MASK = (MAX_BACKENDS << 1) + 1
- Exclusive release resets all shared bits
- Readers never persist >1 reference after wakeup

Does this resolve the concern? I appreciate you flagging this subtlety –please correct me if I've misunderstood your scenario or misinterpretedthe code.

And I'd appreciate you for sharing your implementation – I particularlyagree with your optimization for exclusive lock acquisition.Avoiding the CAS loop when the lock is already held (by checking stateearly) is a clever reduction of atomic operations, which we know arecostly on high-core-count systems. I’ll prioritize evaluating this forour HammerDB/TPROC-C workload and share benchmark results soon.

Regarding shared locks: Your version (using sub_fetch on acquisitionfailure) does align more cleanly with the original state machine byavoiding transient overcounts. I initially came up with a similarapproach but shifted to the single-atomic-increment design to minimizeatomic instructions – a critical priority for our 384-core benchmarkswhere atomic ops dominate contention.


Let’s reconcile these strengths:

1. I’ll test your patch head-to-head against our current version in HCCTPROC-C workloads.2. If the atomic savings in your exclusive path yield meaningful gains,we will try to integrate it into our patch immediately.1. For shared locks: if your design shows comparable performance whilesimplifying correctness, it’s a compelling option.


Really appreciate you driving this optimization further!

Regards,
Zhiguo

Re: Optimize shared LWLock acquisition for high-core-count systems

Reply via email to