Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA

Sokolov Yura Thu, 19 Oct 2017 04:38:24 -0700

Hi,

On 2017-10-19 03:03, Andres Freund wrote:

Hi,


On 2017-09-08 22:35:39 +0300, Sokolov Yura wrote:

 /*
* Internal function that tries to atomically acquire the lwlock inthe passed
- * in mode.
+ * in mode. If it could not grab the lock, it doesn't puts proc intowait
+ * queue.
  *
- * This function will not block waiting for a lock to become free -that's the
- * callers job.
+ * It is used only in LWLockConditionalAcquire.
  *
- * Returns true if the lock isn't free and we need to wait.
+ * Returns true if the lock isn't free.
  */
 static bool
-LWLockAttemptLock(LWLock *lock, LWLockMode mode)
+LWLockAttemptLockOnce(LWLock *lock, LWLockMode mode)


This now has become a fairly special cased function, I'm not convinced
it makes much sense with the current naming and functionality.


It just had not to be as complex as LWLockAttpemptLockOrQueueSelf.
Their functionality is too different.

+/*

+ * Internal function that tries to atomically acquire the lwlock inthe passed

+ * in mode or put it self into waiting queue with waitmode.

+ * This function will not block waiting for a lock to become free -that's the

+ * callers job.
+ *
+ * Returns true if the lock isn't free and we are in a wait queue.
+ */
+static inline bool

+LWLockAttemptLockOrQueueSelf(LWLock *lock, LWLockMode mode,LWLockMode waitmode)

+{
+       uint32          old_state;
+       SpinDelayStatus delayStatus;
+       bool            lock_free;
+       uint32          mask,
+                               add;
+
+       AssertArg(mode == LW_EXCLUSIVE || mode == LW_SHARED);
+
+       if (mode == LW_EXCLUSIVE)
+       {
+               mask = LW_LOCK_MASK;
+               add = LW_VAL_EXCLUSIVE;
+       }
+       else
+       {
+               mask = LW_VAL_EXCLUSIVE;
+               add = LW_VAL_SHARED;
+       }
+
+       init_local_spin_delay(&delayStatus);

The way you moved this around has the disadvantage that we now do this-

a number of writes - even in the very common case where the lwlock can
be acquired directly.


Excuse me, I don't understand fine.
Do you complain against init_local_spin_delay placed here?
Placing it in other place will complicate code.

Or you complain against setting `mask` and `add`?
At least it simplifies loop body. For example, it is simpler to write
specialized Power assembly using already filled `mask+add` registers
(i have a working draft with power assembly).

In both cases, I think simpler version should be accepted first. It acts
as algorithm definition. And it already gives measurable improvement.
After some testing, attempt to improve it may be performed (assuming
release will be in a next autumn, there is enough time for testing and
improvements)
I can be mistaken.

+       /*
+ * We intentionally do not call finish_spin_delay here, because theloop+ * above usually finished by queuing into the wait list oncontention, and
+        * doesn't reach spins_per_delay thereby doesn't sleep inside of
+        * perform_spin_delay. Also, different LWLocks has very different
+ * contention pattern, and it is wrong to update spin-lock statisticbased
+        * on LWLock contention.
+        */
Huh? This seems entirely unconvincing. Without adjusting this herewe'll
just spin the same way every iteration. Especially for the case where
somebody else holds LW_FLAG_LOCKED that's quite bad.


LWLock's are very different. Some of them are always short-term
(BufferLock), others are always locked for a long time. Some sparse, and
some crowded. It is quite bad to tune single variable `spins_per_delay`
by such different load.

And it is already tuned by spin-locks. And spin-locks tune it quitewell.


I've tried to place this delay into lock itself (it has 2 free bytes),
but this attempt performed worse.
Now I understand, that delays should be stored in array indexed by
tranche. But I have no time to test this idea. And I doubt it will give
cardinally better results (ie > 5%), so I think it is better to accept
patch in this way, and then experiment with per-tranche delay.

From e5b13550fc48d62b0b855bedd7fcd5848b806b09 Mon Sep 17 00:00:00 2001
From: Sokolov Yura <funny.fal...@postgrespro.ru>
Date: Tue, 30 May 2017 18:54:25 +0300
Subject: [PATCH 5/6] Fix implementation description in a lwlock.c .

---
 src/backend/storage/lmgr/lwlock.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)
diff --git a/src/backend/storage/lmgr/lwlock.cb/src/backend/storage/lmgr/lwlock.c
index 334c2a2d96..0a41c2c4e2 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -62,16 +62,15 @@
* notice that we have to wait. Unfortunately by the time we havefinished* queuing, the former locker very well might have already finishedit's* work. That's problematic because we're now stuck waiting insidethe OS.
-
- * To mitigate those races we use a two phased attempt at locking:
- *      Phase 1: Try to do it atomically, if we succeed, nice
- *      Phase 2: Add ourselves to the waitqueue of the lock
- * Phase 3: Try to grab the lock again, if we succeed, removeourselves from
- *                       the queue
- *      Phase 4: Sleep till wake-up, goto Phase 1
  *
- * This protects us against the problem from above as nobody canrelease too- * quick, before we're queued, since after Phase 2 we're alreadyqueued.+ * This race is avoided by taking a lock for the wait list using CASwith the old+ * state value, so it would only succeed if lock is still held.Necessary flag
+ * is set using the same CAS.
+ *
+ * Inside LWLockConflictsWithVar the wait list lock is reused toprotect the variable+ * value. First the lock is acquired to check the variable value,then flags are+ * set with a second CAS before queuing to the wait list in order toensure that the lock was not
+ * released yet.
*-------------------------------------------------------------------------
  */
I think this needs more extensive surgery.


Again I don't understand what you mean. Looks like my English is not
very fluent :-(
Do you mean, there should be more thorough description?

Is description from letter is good enough to be copied into thiscomment?

From cc74a849a64e331930a2285e15445d7f08b54169 Mon Sep 17 00:00:00 2001
From: Sokolov Yura <funny.fal...@postgrespro.ru>
Date: Fri, 2 Jun 2017 11:34:23 +0000
Subject: [PATCH 6/6] Make SpinDelayStatus a bit lighter.

It saves couple of instruction of fast path of spin-loop, and so makes

fast path of LWLock a bit faster (and in other places where spinlockis

used).

Also it makes call to perform_spin_delay a bit slower, that could
positively affect on spin behavior, especially if no `pause`instruction
present.


Whaa? That seems pretty absurd reasoning.


:-) AMD Opteron doesn't respect "pause" instruction at all (as it is
written in s_lock.h , and several times mentioned in Internet).
Given whole reason of `perfrom_spin_delay` is to spent time somewhere,
making it a tiny-bit slower may only improve things :-)

On 2017-10-19 02:28, Andres Freund wrote:

On 2017-06-05 16:22:58 +0300, Sokolov Yura wrote:

Algorithm for LWLockWaitForVar is also refactored. New version is:
1. If lock is not held by anyone, it immediately exit.
2. Otherwise it is checked for ability to take WaitList lock, because
  variable is protected with it. If so, CAS is performed, and if it is
  successful, loop breaked to step 4.
3. Otherwise spin_delay perfomed, and loop returns to step 1.
4. var is checked for change.
  If it were changed, we unlock wait list lock and exit.
  Note: it could not change in following steps because we are holding
  wait list lock.

5. Otherwise CAS on setting necessary flags is performed. If itsucceed,

  then queue Proc to wait list and exit.
6. If CAS failed, then there is possibility for LWLock to be already
  released - if so then we should unlock wait list and exit.
7. Otherwise loop returns to step 5.

So invariant is preserved:
- either we found LWLock free,
- or we found changed variable,
- or we set flags and queue self while LWLock were held.

Spin_delay is not performed at step 7, because we should release wait
list lock as soon as possible.


That seems unconvincing - by not delaying you're more likely to
*increase* the time till the current locker that holds the lock can
release the lock.


But why? If our CAS wasn't satisfied, then other's one were satisfied.
So there is always overall progress. And if it will sleep in this
loop, then other waiters will spin in first loop in this functions.
But given concrete usage of LWLockWaitForVar, probably it is not too
bad to hold other waiters in first loop in this function (they loops
until we release WaitListLock or lock released at whole).

I'm in doubts what is better. May I keep it unchanged now?

BTW, I found a small mistake in this place: I forgot to set
LW_FLAG_LOCKED in a state before this CAS. Looks like it wasn't real
error, because CAS always failed at first loop iteration (because real
`lock->state` had LW_FLAG_LOCKED already set), and after failed CAS
state adsorbs value from `lock->state`.
I'll fix it.

One big advantage of this is that we should be able to make LW_SHARED
acquisition an xadd. I'd done that previously, when the separate
spinlock still existed, and it was quite beneficial for performance on
larger systems. But it was some fiddly code - should be easier now.
Requires some careful management when noticing that an xadd acquired a
shared lock even though there's a conflicting exclusive lock, but
increase the fastpath.


--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA

Reply via email to