Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA

Andres Freund Wed, 18 Oct 2017 17:05:03 -0700

Hi,

On 2017-09-08 22:35:39 +0300, Sokolov Yura wrote:
> From 722a8bed17f82738135264212dde7170b8c0f397 Mon Sep 17 00:00:00 2001
> From: Sokolov Yura <funny.fal...@postgrespro.ru>
> Date: Mon, 29 May 2017 09:25:41 +0000
> Subject: [PATCH 1/6] More correct use of spinlock inside LWLockWaitListLock.
> 
> SpinDelayStatus should be function global to count iterations correctly,
> and produce more correct delays.
> 
> Also if spin didn't spin, do not count it in spins_per_delay correction.
> It wasn't necessary before cause delayStatus were used only in contented
> cases.


I'm not convinced this is benefial. Adds a bit of overhead to the
casewhere LW_FLAG_LOCKED can be set immediately. OTOH, it's not super
likely to make a large difference if the lock has to be taken anyway...


> +
> +/* just shortcut to not declare lwlock_stats variable at the end of function 
> */
> +static void
> +add_lwlock_stats_spin_stat(LWLock *lock, SpinDelayStatus *delayStatus)
> +{
> +     lwlock_stats *lwstats;
> +
> +     lwstats = get_lwlock_stats_entry(lock);
> +     lwstats->spin_delay_count += delayStatus->delays;
> +}
>  #endif                                                       /* LWLOCK_STATS 
> */

This seems like a pretty independent change.


>  /*
>   * Internal function that tries to atomically acquire the lwlock in the 
> passed
> - * in mode.
> + * in mode. If it could not grab the lock, it doesn't puts proc into wait
> + * queue.
>   *
> - * This function will not block waiting for a lock to become free - that's 
> the
> - * callers job.
> + * It is used only in LWLockConditionalAcquire.
>   *
> - * Returns true if the lock isn't free and we need to wait.
> + * Returns true if the lock isn't free.
>   */
>  static bool
> -LWLockAttemptLock(LWLock *lock, LWLockMode mode)
> +LWLockAttemptLockOnce(LWLock *lock, LWLockMode mode)

This now has become a fairly special cased function, I'm not convinced
it makes much sense with the current naming and functionality.


> +/*
> + * Internal function that tries to atomically acquire the lwlock in the 
> passed
> + * in mode or put it self into waiting queue with waitmode.
> + * This function will not block waiting for a lock to become free - that's 
> the
> + * callers job.
> + *
> + * Returns true if the lock isn't free and we are in a wait queue.
> + */
> +static inline bool
> +LWLockAttemptLockOrQueueSelf(LWLock *lock, LWLockMode mode, LWLockMode 
> waitmode)
> +{
> +     uint32          old_state;
> +     SpinDelayStatus delayStatus;
> +     bool            lock_free;
> +     uint32          mask,
> +                             add;
> +
> +     AssertArg(mode == LW_EXCLUSIVE || mode == LW_SHARED);
> +
> +     if (mode == LW_EXCLUSIVE)
> +     {
> +             mask = LW_LOCK_MASK;
> +             add = LW_VAL_EXCLUSIVE;
> +     }
> +     else
> +     {
> +             mask = LW_VAL_EXCLUSIVE;
> +             add = LW_VAL_SHARED;
> +     }
> +
> +     init_local_spin_delay(&delayStatus);

The way you moved this around has the disadvantage that we now do this -
a number of writes - even in the very common case where the lwlock can
be acquired directly.


> +     /*
> +      * Read once outside the loop. Later iterations will get the newer value
> +      * either via compare & exchange or with read after perform_spin_delay.
> +      */
> +     old_state = pg_atomic_read_u32(&lock->state);
> +     /* loop until we've determined whether we could acquire the lock or not 
> */
> +     while (true)
> +     {
> +             uint32          desired_state;
> +
> +             desired_state = old_state;
> +
> +             lock_free = (old_state & mask) == 0;
> +             if (lock_free)
>               {
> -                     if (lock_free)
> +                     desired_state += add;
> +                     if (pg_atomic_compare_exchange_u32(&lock->state,
> +                                                                             
>            &old_state, desired_state))
>                       {
>                               /* Great! Got the lock. */
>  #ifdef LOCK_DEBUG
>                               if (mode == LW_EXCLUSIVE)
>                                       lock->owner = MyProc;
>  #endif
> -                             return false;
> +                             break;
> +                     }
> +             }
> +             else if ((old_state & LW_FLAG_LOCKED) == 0)
> +             {
> +                     desired_state |= LW_FLAG_LOCKED | LW_FLAG_HAS_WAITERS;
> +                     if (pg_atomic_compare_exchange_u32(&lock->state,
> +                                                                             
>            &old_state, desired_state))
> +                     {
> +                             LWLockQueueSelfLocked(lock, waitmode);
> +                             break;
>                       }
> -                     else
> -                             return true;    /* somebody else has the lock */
> +             }
> +             else
> +             {
> +                     perform_spin_delay(&delayStatus);
> +                     old_state = pg_atomic_read_u32(&lock->state);
>               }
>       }
> -     pg_unreachable();
> +#ifdef LWLOCK_STATS
> +     add_lwlock_stats_spin_stat(lock, &delayStatus);
> +#endif
> +
> +     /*
> +      * We intentionally do not call finish_spin_delay here, because the loop
> +      * above usually finished by queuing into the wait list on contention, 
> and
> +      * doesn't reach spins_per_delay thereby doesn't sleep inside of
> +      * perform_spin_delay. Also, different LWLocks has very different
> +      * contention pattern, and it is wrong to update spin-lock statistic 
> based
> +      * on LWLock contention.
> +      */

Huh? This seems entirely unconvincing. Without adjusting this here we'll
just spin the same way every iteration. Especially for the case where
somebody else holds LW_FLAG_LOCKED that's quite bad.


> From e5b13550fc48d62b0b855bedd7fcd5848b806b09 Mon Sep 17 00:00:00 2001
> From: Sokolov Yura <funny.fal...@postgrespro.ru>
> Date: Tue, 30 May 2017 18:54:25 +0300
> Subject: [PATCH 5/6] Fix implementation description in a lwlock.c .
> 
> ---
>  src/backend/storage/lmgr/lwlock.c | 17 ++++++++---------
>  1 file changed, 8 insertions(+), 9 deletions(-)
> 
> diff --git a/src/backend/storage/lmgr/lwlock.c 
> b/src/backend/storage/lmgr/lwlock.c
> index 334c2a2d96..0a41c2c4e2 100644
> --- a/src/backend/storage/lmgr/lwlock.c
> +++ b/src/backend/storage/lmgr/lwlock.c
> @@ -62,16 +62,15 @@
>   * notice that we have to wait. Unfortunately by the time we have finished
>   * queuing, the former locker very well might have already finished it's
>   * work. That's problematic because we're now stuck waiting inside the OS.
> -
> - * To mitigate those races we use a two phased attempt at locking:
> - *    Phase 1: Try to do it atomically, if we succeed, nice
> - *    Phase 2: Add ourselves to the waitqueue of the lock
> - *    Phase 3: Try to grab the lock again, if we succeed, remove ourselves 
> from
> - *                     the queue
> - *    Phase 4: Sleep till wake-up, goto Phase 1
>   *
> - * This protects us against the problem from above as nobody can release too
> - *     quick, before we're queued, since after Phase 2 we're already queued.
> + * This race is avoided by taking a lock for the wait list using CAS with 
> the old
> + * state value, so it would only succeed if lock is still held. Necessary 
> flag
> + * is set using the same CAS.
> + *
> + * Inside LWLockConflictsWithVar the wait list lock is reused to protect the 
> variable
> + * value. First the lock is acquired to check the variable value, then flags 
> are
> + * set with a second CAS before queuing to the wait list in order to ensure 
> that the lock was not
> + * released yet.
>   * -------------------------------------------------------------------------
>   */

I think this needs more extensive surgery.



> From cc74a849a64e331930a2285e15445d7f08b54169 Mon Sep 17 00:00:00 2001
> From: Sokolov Yura <funny.fal...@postgrespro.ru>
> Date: Fri, 2 Jun 2017 11:34:23 +0000
> Subject: [PATCH 6/6] Make SpinDelayStatus a bit lighter.
> 
> It saves couple of instruction of fast path of spin-loop, and so makes
> fast path of LWLock a bit faster (and in other places where spinlock is
> used).

> Also it makes call to perform_spin_delay a bit slower, that could
> positively affect on spin behavior, especially if no `pause` instruction
> present.

Whaa? That seems pretty absurd reasoning.




One big advantage of this is that we should be able to make LW_SHARED
acquisition an xadd. I'd done that previously, when the separate
spinlock still existed, and it was quite beneficial for performance on
larger systems. But it was some fiddly code - should be easier now.
Requires some careful management when noticing that an xadd acquired a
shared lock even though there's a conflicting exclusive lock, but
increase the fastpath.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA

Reply via email to