On Thu, Nov 05, 2015 at 04:35:22PM -0700, Ian Lepore wrote:
> On Thu, 2015-11-05 at 14:19 -0800, John Baldwin wrote:
> > On Thursday, November 05, 2015 01:45:19 PM Adrian Chadd wrote:
> > > On 5 November 2015 at 11:26, Mateusz Guzik <mjgu...@gmail.com>
> > > wrote:
> > > > On Thu, Nov 05, 2015 at 11:04:13AM -0800, John Baldwin wrote:
> > > > > On Thursday, November 05, 2015 04:26:28 PM Konstantin Belousov
> > > > > wrote:
> > > > > > On Thu, Nov 05, 2015 at 12:32:18AM +0100, Mateusz Guzik
> > > > > > wrote:
> > > > > > > mtx_lock will unconditionally try to grab the lock and if
> > > > > > > that fails,
> > > > > > > will call __mtx_lock_sleep which will immediately try to do
> > > > > > > the same
> > > > > > > atomic op again.
> > > > > > > 
> > > > > > > So, the obvious microoptimization is to check the state in
> > > > > > > __mtx_lock_sleep and avoid the operation if the lock is not
> > > > > > > free.
> > > > > > > 
> > > > > > > This gives me ~40% speedup in a microbenchmark of 40 find
> > > > > > > processes
> > > > > > > traversing tmpfs and contending on mount mtx (only used as
> > > > > > > an easy
> > > > > > > benchmark, I have WIP patches to get rid of it).
> > > > > > > 
> > > > > > > Second part of the patch is optional and just checks the
> > > > > > > state of the
> > > > > > > lock prior to doing any atomic operations, but it gives a
> > > > > > > very modest
> > > > > > > speed up when applied on top of the __mtx_lock_sleep
> > > > > > > change. As such,
> > > > > > > I'm not going to defend this part.
> > > > > > Shouldn't the same consideration applied to all spinning
> > > > > > loops, i.e.
> > > > > > also to the spin/thread mutexes, and to the spinning parts of
> > > > > > sx and
> > > > > > lockmgr ?
> > > > > 
> > > > > I agree.  I think both changes are good and worth doing in our
> > > > > other
> > > > > primitives.
> > > > > 
> > > > 
> > > > I glanced over e.g. rw_rlock and it did not have the issue, now
> > > > that I
> > > > see _sx_xlock_hard it wuld indeed use fixing.
> > > > 
> > > > Expect a patch in few h for all primitives I'll find. I'll stress
> > > > test
> > > > the kernel, but it is unlikely I'll do microbenchmarks for
> > > > remaining
> > > > primitives.
> > > 
> > > Is this stuff you're proposing still valid for non-x86 platforms?
> > 
> > Yes.  It just does a read before trying the atomic compare and swap
> > and
> > falls through to the hard case as if the atomic op failed if the
> > result
> > of the read would result in a compare failure.
> > 
> 
> The atomic ops include barriers, the new pre-read of the variable
> doesn't.  Will that cause problems, especially for code inside a loop
> where the compiler may decide to shuffle things around?
> 

Lock value is volatile, so the compiler should not screw us up. I removed
the store to the variable due to a different concern. In worst case we
would have slept instead of spinning. (see below)

> I suspect the performance gain will be biggest on the platforms where
> atomic ops are expensive (I gather from various code comments that's
> the case on x86).
> 

I don't know about other architectures, on x86 atomic op performance on
contended cachelines deteriorates drastically.

=======================================================

For the most port the patch below does 2 simple kinds of changes:
1. in macros for lock/unlock primitives the lock value is fetched and
  compared against relevant _UNLOCKED macro prior to the atomic op
2. in functions:

before:
while (!_mtx_obtain_lock(m, tid)) {
        v = m->mtx_lock;
        if (v != MTX_UNOWNED) {
                ....
        }
        .....
}

after:
for (;;) {
        if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, tid))
                break;
        v = m->mtx_lock;
        if (v != MTX_UNOWNED) {
                ....
        }
        .....
}

The original patch preloaded the 'v' variable. If the value was
MTX_UNOWNED and _mtx_obtain_lock failed, we just lost the race.  In
order to spin waiting for the other thread to release the lock, we have
to re-read the variable. Thus for simplicity it is no longer stored in
'v'. Note this is contrary to kib's earlier suggestion which would put
such threads directly to sleep. I don't have proper measurements to have
any strong opinion here, I went the aforementioned route to minimise
changes.

Note this is a trivial patch to improve the situation a little bit, I
have no interest in trying to polish these primitives at this time.

For overall results, on a machine with 32GB of RAM and the following:
CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz (2800.06-MHz K8-class
CPU)
FreeBSD/SMP: 2 package(s) x 10 core(s) x 2 SMT threads

this reduced make -j 40 kernel in a 10.2 jail from ~2:15 to ~2:05.

=======================================================

diff --git a/sys/kern/kern_lock.c b/sys/kern/kern_lock.c
index aa67180..198e93e 100644
--- a/sys/kern/kern_lock.c
+++ b/sys/kern/kern_lock.c
@@ -787,8 +787,10 @@ __lockmgr_args(struct lock *lk, u_int flags, struct 
lock_object *ilk,
                        break;
                }
 
-               while (!atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED,
-                   tid)) {
+               for (;;) {
+                       if (lk->lk_lock == LK_UNLOCKED &&
+                           atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED, 
tid))
+                               break;
 #ifdef HWPMC_HOOKS
                        PMC_SOFT_CALL( , , lock, failed);
 #endif
@@ -1124,7 +1126,11 @@ __lockmgr_args(struct lock *lk, u_int flags, struct 
lock_object *ilk,
                            __func__, iwmesg, file, line);
                }
 
-               while (!atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED, tid)) {
+               for (;;) {
+                       if (lk->lk_lock == LK_UNLOCKED &&
+                           atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED, 
tid))
+                               break;
+
 #ifdef HWPMC_HOOKS
                        PMC_SOFT_CALL( , , lock, failed);
 #endif
diff --git a/sys/kern/kern_mutex.c b/sys/kern/kern_mutex.c
index bec8f6b..51e82a3 100644
--- a/sys/kern/kern_mutex.c
+++ b/sys/kern/kern_mutex.c
@@ -419,7 +419,9 @@ __mtx_lock_sleep(volatile uintptr_t *c, uintptr_t tid, int 
opts,
        all_time -= lockstat_nsecs(&m->lock_object);
 #endif
 
-       while (!_mtx_obtain_lock(m, tid)) {
+       for (;;) {
+               if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, tid))
+                       break;
 #ifdef KDTRACE_HOOKS
                spin_cnt++;
 #endif
@@ -602,8 +604,9 @@ _mtx_lock_spin_cookie(volatile uintptr_t *c, uintptr_t tid, 
int opts,
 #ifdef KDTRACE_HOOKS
        spin_time -= lockstat_nsecs(&m->lock_object);
 #endif
-       while (!_mtx_obtain_lock(m, tid)) {
-
+       for (;;) {
+               if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, tid))
+                       break;
                /* Give interrupts a chance while we spin. */
                spinlock_exit();
                while (m->mtx_lock != MTX_UNOWNED) {
@@ -675,7 +678,9 @@ retry:
                            m->lock_object.lo_name, file, line));
                WITNESS_CHECKORDER(&m->lock_object,
                    opts | LOP_NEWORDER | LOP_EXCLUSIVE, file, line, NULL);
-               while (!_mtx_obtain_lock(m, tid)) {
+               for (;;) {
+                       if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, 
tid))
+                               break;
                        if (m->mtx_lock == tid) {
                                m->mtx_recurse++;
                                break;
diff --git a/sys/kern/kern_rwlock.c b/sys/kern/kern_rwlock.c
index 6541724..6a904d2 100644
--- a/sys/kern/kern_rwlock.c
+++ b/sys/kern/kern_rwlock.c
@@ -771,7 +771,9 @@ __rw_wlock_hard(volatile uintptr_t *c, uintptr_t tid, const 
char *file,
        all_time -= lockstat_nsecs(&rw->lock_object);
        state = rw->rw_lock;
 #endif
-       while (!_rw_write_lock(rw, tid)) {
+       for (;;) {
+               if (rw->rw_lock == RW_UNLOCKED && _rw_write_lock(rw, tid))
+                       break;
 #ifdef KDTRACE_HOOKS
                spin_cnt++;
 #endif
diff --git a/sys/kern/kern_sx.c b/sys/kern/kern_sx.c
index 96e117b..2a81c04 100644
--- a/sys/kern/kern_sx.c
+++ b/sys/kern/kern_sx.c
@@ -544,7 +544,10 @@ _sx_xlock_hard(struct sx *sx, uintptr_t tid, int opts, 
const char *file,
        all_time -= lockstat_nsecs(&sx->lock_object);
        state = sx->sx_lock;
 #endif
-       while (!atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid)) {
+       for (;;) {
+               if (sx->sx_lock == SX_LOCK_UNLOCKED &&
+                   atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid))
+                       break;
 #ifdef KDTRACE_HOOKS
                spin_cnt++;
 #endif
diff --git a/sys/sys/mutex.h b/sys/sys/mutex.h
index a9ec072..0443922 100644
--- a/sys/sys/mutex.h
+++ b/sys/sys/mutex.h
@@ -185,7 +185,7 @@ void        thread_lock_flags_(struct thread *, int, const 
char *, int);
 #define __mtx_lock(mp, tid, opts, file, line) do {                     \
        uintptr_t _tid = (uintptr_t)(tid);                              \
                                                                        \
-       if (!_mtx_obtain_lock((mp), _tid))                              \
+       if (((mp)->mtx_lock != MTX_UNOWNED || !_mtx_obtain_lock((mp), _tid)))\
                _mtx_lock_sleep((mp), _tid, (opts), (file), (line));    \
        else                                                            \
                LOCKSTAT_PROFILE_OBTAIN_LOCK_SUCCESS(adaptive__acquire, \
@@ -203,7 +203,7 @@ void        thread_lock_flags_(struct thread *, int, const 
char *, int);
        uintptr_t _tid = (uintptr_t)(tid);                              \
                                                                        \
        spinlock_enter();                                               \
-       if (!_mtx_obtain_lock((mp), _tid)) {                            \
+       if (((mp)->mtx_lock != MTX_UNOWNED || !_mtx_obtain_lock((mp), _tid))) {\
                if ((mp)->mtx_lock == _tid)                             \
                        (mp)->mtx_recurse++;                            \
                else                                                    \
@@ -232,7 +232,7 @@ void        thread_lock_flags_(struct thread *, int, const 
char *, int);
                                                                        \
        if ((mp)->mtx_recurse == 0)                                     \
                LOCKSTAT_PROFILE_RELEASE_LOCK(adaptive__release, mp);   \
-       if (!_mtx_release_lock((mp), _tid))                             \
+       if ((mp)->mtx_lock != _tid || !_mtx_release_lock((mp), _tid))   \
                _mtx_unlock_sleep((mp), (opts), (file), (line));        \
 } while (0)
 
diff --git a/sys/sys/rwlock.h b/sys/sys/rwlock.h
index f8947c5..6b4f505 100644
--- a/sys/sys/rwlock.h
+++ b/sys/sys/rwlock.h
@@ -96,7 +96,7 @@
 #define        __rw_wlock(rw, tid, file, line) do {                            
\
        uintptr_t _tid = (uintptr_t)(tid);                              \
                                                                        \
-       if (!_rw_write_lock((rw), _tid))                                \
+       if ((rw)->rw_lock != RW_UNLOCKED || !_rw_write_lock((rw), _tid))\
                _rw_wlock_hard((rw), _tid, (file), (line));             \
        else                                                            \
                LOCKSTAT_PROFILE_OBTAIN_RWLOCK_SUCCESS(rw__acquire, rw, \
@@ -112,7 +112,7 @@
        else {                                                          \
                LOCKSTAT_PROFILE_RELEASE_RWLOCK(rw__release, rw,        \
                    LOCKSTAT_WRITER);                                   \
-               if (!_rw_write_unlock((rw), _tid))                      \
+               if ((rw)->rw_lock != _tid || !_rw_write_unlock((rw), _tid))\
                        _rw_wunlock_hard((rw), _tid, (file), (line));   \
        }                                                               \
 } while (0)
diff --git a/sys/sys/sx.h b/sys/sys/sx.h
index 96a664f..144cab5 100644
--- a/sys/sys/sx.h
+++ b/sys/sys/sx.h
@@ -150,7 +150,8 @@ __sx_xlock(struct sx *sx, struct thread *td, int opts, 
const char *file,
        uintptr_t tid = (uintptr_t)td;
        int error = 0;
 
-       if (!atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid))
+       if (sx->sx_lock != SX_LOCK_UNLOCKED ||
+           !atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid))
                error = _sx_xlock_hard(sx, tid, opts, file, line);
        else 
                LOCKSTAT_PROFILE_OBTAIN_RWLOCK_SUCCESS(sx__acquire, sx,
@@ -168,7 +169,8 @@ __sx_xunlock(struct sx *sx, struct thread *td, const char 
*file, int line)
        if (sx->sx_recurse == 0)
                LOCKSTAT_PROFILE_RELEASE_RWLOCK(sx__release, sx,
                    LOCKSTAT_WRITER);
-       if (!atomic_cmpset_rel_ptr(&sx->sx_lock, tid, SX_LOCK_UNLOCKED))
+       if (sx->sx_lock != tid ||
+           !atomic_cmpset_rel_ptr(&sx->sx_lock, tid, SX_LOCK_UNLOCKED))
                _sx_xunlock_hard(sx, tid, file, line);
 }
 

_______________________________________________
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Reply via email to