On 15/12/17(Fri) 22:03, Mateusz Guzik wrote: > [...] > However, contended behaviour is a regression compared to the asm > variant.
Now that I checked the files in could you generate a diff with your suggestions? > From what I gather this is a step towards unifying all mutex > implementations, hence the membar_* use. Exactly. > > +void > > +__mtx_enter(struct mutex *mtx) > > +{ > > +#ifdef MP_LOCKDEBUG > > + int nticks = __mp_lock_spinout; > > +#endif > > + > > + while (__mtx_enter_try(mtx) == 0) { > > + CPU_BUSY_CYCLE(); > > So this is effectively __mtx_enter_try with single pause in-between. > > > +} > > + > > +int > > +__mtx_enter_try(struct mutex *mtx) > > +{ > > + struct cpu_info *owner, *ci = curcpu(); > > + int s; > > + > > + if (mtx->mtx_wantipl != IPL_NONE) > > + s = splraise(mtx->mtx_wantipl); > > + > > This is at least one read. > > > + owner = atomic_cas_ptr(&mtx->mtx_owner, NULL, ci); > > +#ifdef DIAGNOSTIC > > + if (__predict_false(owner == ci)) > > + panic("mtx %p: locking against myself", mtx); > > +#endif > > + if (owner == NULL) { > > + membar_enter_after_atomic(); > > + if (mtx->mtx_wantipl != IPL_NONE) > > + mtx->mtx_oldipl = s; > > This repeats the read done earlier. > > Since the caller loops, this is effectively a very inefficient lock of > the form: > while (!atomic_cas_ptr(...)) > CPU_BUSY_CYCLE(); > > + some reads in-between > > Assembly code would spin waiting for the lock to become free before > playing with spl level and attempting to lock. I don't know how > contended your mutexes are right now, but this will have to be very > quickly changed at least to current asm behaviour as more of the kernel > gets unlocked. Going for ticket locks or backoff later will probably > work fine enough for the foreseeable future and will postpone the need > to invest into anything fancy. > > That said, proposed restructure is as follows (pseudo-code, no debug): > > void > __mtx_enter(struct mutex *mtx) > { > int want_ipl, s; > > /* mark mtx_wantipl as volatile or add a read casted through > * one to force a read *here* and no re-reads later */ > want_ipl = mtx->mtx_wantipl; > > for (;;) { > if (want_ipl != IPL_NONE) > s = splraise(want_ipl); > owner = atomic_cas_ptr(&mtx->mtx_owner, NULL, ci); > if (owner == NULL) { > membar_enter_after_atomic(); > if (want_ipl != IPL_NONE) > mtx->mtx_oldipl = s; > break; > } > > if (want_ipl != IPL_NONE) > splx(s); > > do { > CPU_BUSY_CYCLE(); > } while (mtx->mtx_owner != NULL); > } > } > > I don't think partial duplication of try_enter can be avoided without > serious ugliness. Duplication is not a problem since we're going to have a single MI file for all (most?) archs. > > +void > > +__mtx_leave(struct mutex *mtx) > > +{ > > + int s; > > + > > + MUTEX_ASSERT_LOCKED(mtx); > > + > > +#ifdef DIAGNOSTIC > > + curcpu()->ci_mutex_level--; > > +#endif > > + > > + s = mtx->mtx_oldipl; > > +#ifdef MULTIPROCESSOR > > + membar_exit_before_atomic(); > > +#endif > > + mtx->mtx_owner = NULL; > > + if (mtx->mtx_wantipl != IPL_NONE) > > + splx(s); > > +} > > It should be somewhat faster to read mtx_wantipl while the lock is still > held - it is less likely that someone else dirtied the cacheline. Then > mtx_oldipl can be only conditionally read. > > This function does not do atomic ops, so membar_exit_before_atomic does > not really fit, even though it happens to do the exact same thing the > "correctly" named primitive would - just provide a compiler barrier on > amd64/i386. > > From what I gather you are trying to mimick illumos nomenclature, but > they don't have an equivalent afaics. (perhaps Solaris grew one in the > meantime?) > > In FreeBSD an appropriate routine is named atomic_thread_fence_rel (see > amd64/include/atomic.h) and I suggest just borrowing the api. API changes are subject to bikescheding so I'd suggest we concentrate on the nice optimizations/improvements you're pointing out. > Side note is that you probably can shorten ipl to vars to make the lock > smaller. It can be doable to fit it into lock word, but I don't know how > much > sense would playing with it make. One reason to have MI locks is also to make debugging easier. I don't know if shrinking the size of the structure goes in the same direction. Time will tell.