Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA

Andres Freund Wed, 18 Oct 2017 16:30:57 -0700

Hi,

On 2017-06-05 16:22:58 +0300, Sokolov Yura wrote:
> Patch changes the way LWLock is acquired.
> 
> Before patch, lock is acquired using following procedure:
> 1. first LWLock->state is checked for ability to take a lock, and CAS
>   performed (either lock could be acquired or not). And it is retried if
>   status were changed.
> 2. if lock is not acquired at first 1, Proc is put into wait list, using
>   LW_FLAG_LOCKED bit of LWLock->state as a spin-lock for wait list.
>   In fact, profile shows that LWLockWaitListLock spends a lot of CPU on
>   contendend LWLock (upto 20%).
>   Additional CAS loop is inside pg_atomic_fetch_or_u32 for setting
>   LW_FLAG_HAS_WAITERS. And releasing wait list lock is another CAS loop
>   on the same LWLock->state.
> 3. after that state is checked again, because lock could be released
>   before Proc were queued into wait list.
> 4. if lock were acquired at step 3, Proc should be dequeued from wait
>   list (another spin lock and CAS loop). And this operation could be
>   quite heavy, because whole list should be traversed.
> 
> Patch makes lock acquiring in single CAS loop:
> 1. LWLock->state is read, and ability for lock acquiring is detected.
>   If there is possibility to take a lock, CAS tried.
>   If CAS were successful, lock is aquired (same to original version).
> 2. but if lock is currently held by other backend, we check ability for
>   taking WaitList lock. If wait list lock is not help by anyone, CAS
>   perfomed for taking WaitList lock and set LW_FLAG_HAS_WAITERS at once.
>   If CAS were successful, then LWLock were still held at the moment wait
>   list lock were held - this proves correctness of new algorithm. And
>   Proc is queued to wait list then.
> 3. Otherwise spin_delay is performed, and loop returns to step 1.


Right, something like this didn't use to be possible because we'd both
the LWLock->state and LWLock->mutex. But after 008608b9d5106 that's not
a concern anymore.

> Algorithm for LWLockWaitForVar is also refactored. New version is:
> 1. If lock is not held by anyone, it immediately exit.
> 2. Otherwise it is checked for ability to take WaitList lock, because
>   variable is protected with it. If so, CAS is performed, and if it is
>   successful, loop breaked to step 4.
> 3. Otherwise spin_delay perfomed, and loop returns to step 1.
> 4. var is checked for change.
>   If it were changed, we unlock wait list lock and exit.
>   Note: it could not change in following steps because we are holding
>   wait list lock.
> 5. Otherwise CAS on setting necessary flags is performed. If it succeed,
>   then queue Proc to wait list and exit.
> 6. If Case failed, then there is possibility for LWLock to be already
>   released - if so then we should unlock wait list and exit.
> 7. Otherwise loop returns to step 5.
> 
> So invariant is preserved:
> - either we found LWLock free,
> - or we found changed variable,
> - or we set flags and queue self while LWLock were held.
> 
> Spin_delay is not performed at step 7, because we should release wait
> list lock as soon as possible.

That seems unconvincing - by not delaying you're more likely to
*increase* the time till the current locker that holds the lock can
release the lock.

> Additionally, taking wait list lock is skipped in both algorithms if
> SpinDelayStatus.spins is less than part of spins_per_delay loop
> iterations (so, it is initial iterations, and some iterations after
> pg_usleep, because SpinDelayStatus.spins is reset after sleep).

I quite strongly think it's a good idea to change this at the same time
as the other changes you're proposing here.  I think it's a worthwhile
thing to pursue, but that change also has quit ethe potential for
regressing in some cases.

- Andres


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA

Reply via email to