Re: Optimize LWLock scalability via ReadBiasedLWLock for heavily-shared locks

Zhou, Zhiguo Thu, 03 Jul 2025 08:39:04 -0700

Hi Andres,

On 7/1/2025 10:06 PM, Andres Freund wrote:

Hi,


On 2025-07-01 09:57:18 -0400, Andres Freund wrote:

On 2025-06-26 13:07:49 +0800, Zhou, Zhiguo wrote:

This patch addresses severe LWLock contention observed on high-core systems
where hundreds of processors concurrently access frequently-shared locks.
Specifically for ProcArrayLock (exhibiting 93.5% shared-mode acquires), we
implement a new ReadBiasedLWLock mechanism to eliminate the atomic operation
bottleneck.

Key aspects:
1. Problem: Previous optimizations[1] left LWLockAttemptLock/Release
consuming
    ~25% total CPU cycles on 384-vCPU systems due to contention on a single
    lock-state cache line. Shared lock attempts showed 37x higher cumulative
    latency than exclusive mode for ProcArrayLock.

2. Solution: ReadBiasedLWLock partitions lock state across 16 cache lines
    (READ_BIASED_LOCK_STATE_COUNT):
    - Readers acquire/release only their designated LWLock (indexed by
      pid % 16) using a single atomic operation
    - Writers pay higher cost by acquiring all 16 sub-locks exclusively
    - Maintains LWLock's "acquiring process must release" semantics

3. Performance: HammerDB/TPCC shows 35.3% NOPM improvement over baseline
    - Lock acquisition CPU cycles reduced from 16.7% to 7.4%
    - Lock release cycles reduced from 7.9% to 2.2%

4. Implementation:
    - Core infrastructure for ReadBiasedLWLock
    - ProcArrayLock converted as proof-of-concept
    - Maintains full LWLock API compatibility

Known considerations:
- Increased writer acquisition cost (acceptable given rarity of exclusive
   acquisitions for biased locks like ProcArrayLock)


Unfortunately I have a very hard time believing that that's unacceptable -
there are plenty workloads (many write intensive ones) where exclusive locks
on ProcArrayLock are the bottleneck.


Ooops, s/unacceptable/acceptable/

Greetings,

Andres Freund

Thank you for raising this important concern about potential impacts onwrite-intensive workloads. You're absolutely right to question whetherincreased exclusive acquisition costs are acceptable. To address your point:

1. We acknowledge this is an aggressive optimization with inherenttrade-offs. While it addresses severe shared-acquisition bottlenecks(particularly relevant for large-core systems), we fully recognize thatthe increased exclusive acquisition cost could be problematic forwrite-heavy scenarios. Our position is that for locks with highlyskewed access patterns like ProcArrayLock (where 93.5% of acquisitionsare shared), this trade-off may be worthwhile.

2. Our focus on HammerDB/TPCC stems from customer workloads whereProcArrayLock contention in shared mode is demonstrably the dominantbottleneck. Our profiling shows:

- 37x higher cumulative latency for shared acquires vs exclusive
- 16.7% of CPU cycles consumed by lock acquisition pre-optimization

This suggests that for this specific lock in OLTP contexts, mitigatingshared contention is critical.


3. To better understand scenarios where this trade-off might be unfavorable:

Could you share specific write-intensive workloads you're concernedabout? We would prioritize evaluating this patch against:

a) Benchmarks known to stress ProcArrayLock in exclusive mode

b) Production-like workloads where you anticipate exclusive acquisitionsmight become problematic

We're committed to testing this rigorously and exploring mitigationstrategies if needed.

4.Before implementing ReadBiasedLWLock, we explored less invasivealternatives in [1]. This approach maintained the existing exclusivelock path while optimizing shared acquisitions with a single atomicoperation. Would you be willing to review that approach first? Webelieve it might offer a more balanced solution while still addressingthe core contention issue identified in TPCC.

We appreciate your expertise here and want to ensure we don't simplyshift the bottleneck from readers to writers. Your guidance on suitablestress tests for exclusive acquisition overhead would be invaluable toour next steps.

[1]Optimize shared LWLock acquisition for high-core-count systems:https://www.postgresql.org/message-id/flat/73d53acf-4f66-41df-b438-5c2e6115d4de%40intel.com


Regards,
Zhiguo

Re: Optimize LWLock scalability via ReadBiasedLWLock for heavily-shared locks

Reply via email to