On 15/12/17(Fri) 22:03, Mateusz Guzik wrote:
> [...] 
> However, contended behaviour is a regression compared to the asm
> variant.

Now that I checked the files in could you generate a diff with your
suggestions?

> From what I gather this is a step towards unifying all mutex
> implementations, hence the membar_* use.

Exactly.

> > +void
> > +__mtx_enter(struct mutex *mtx)
> > +{
> > +#ifdef MP_LOCKDEBUG
> > +     int nticks = __mp_lock_spinout;
> > +#endif
> > +
> > +     while (__mtx_enter_try(mtx) == 0) {
> > +             CPU_BUSY_CYCLE();
> 
> So this is effectively __mtx_enter_try with single pause in-between.
> 
> > +}
> > +
> > +int
> > +__mtx_enter_try(struct mutex *mtx)
> > +{
> > +     struct cpu_info *owner, *ci = curcpu();
> > +     int s;
> > +
> > +     if (mtx->mtx_wantipl != IPL_NONE)
> > +             s = splraise(mtx->mtx_wantipl);
> > +
> 
> This is at least one read.
> 
> > +     owner = atomic_cas_ptr(&mtx->mtx_owner, NULL, ci);
> > +#ifdef DIAGNOSTIC
> > +     if (__predict_false(owner == ci))
> > +             panic("mtx %p: locking against myself", mtx);
> > +#endif
> > +     if (owner == NULL) {
> > +             membar_enter_after_atomic();
> > +             if (mtx->mtx_wantipl != IPL_NONE)
> > +                     mtx->mtx_oldipl = s;
> 
> This repeats the read done earlier.
> 
> Since the caller loops, this is effectively a very inefficient lock of
> the form:
> while (!atomic_cas_ptr(...))
>         CPU_BUSY_CYCLE();
> 
> + some reads in-between
> 
> Assembly code would spin waiting for the lock to become free before
> playing with spl level and attempting to lock. I don't know how
> contended your mutexes are right now, but this will have to be very
> quickly changed at least to current asm behaviour as more of the kernel
> gets unlocked. Going for ticket locks or backoff later will probably
> work fine enough for the foreseeable future and will postpone the need
> to invest into anything fancy.
> 
> That said, proposed restructure is as follows (pseudo-code, no debug):
> 
> void
> __mtx_enter(struct mutex *mtx)
> {
>         int want_ipl, s;
> 
>         /* mark mtx_wantipl as volatile or add a read casted through
>          * one to force a read *here* and no re-reads later */
>         want_ipl = mtx->mtx_wantipl;
> 
>         for (;;) {
>                 if (want_ipl != IPL_NONE)
>                         s = splraise(want_ipl);
>                 owner = atomic_cas_ptr(&mtx->mtx_owner, NULL, ci);
>                 if (owner == NULL) {
>                         membar_enter_after_atomic();
>                         if (want_ipl != IPL_NONE)
>                                 mtx->mtx_oldipl = s;
>                         break;
>                 }
> 
>                 if (want_ipl != IPL_NONE)
>                         splx(s);
> 
>                 do {
>                         CPU_BUSY_CYCLE();
>                 } while (mtx->mtx_owner != NULL);
>         }
> }
> 
> I don't think partial duplication of try_enter can be avoided without
> serious ugliness.

Duplication is not a problem since we're going to have a single MI file
for all (most?) archs.

> > +void
> > +__mtx_leave(struct mutex *mtx)
> > +{
> > +     int s;
> > +
> > +     MUTEX_ASSERT_LOCKED(mtx);
> > +
> > +#ifdef DIAGNOSTIC
> > +     curcpu()->ci_mutex_level--;
> > +#endif
> > +
> > +     s = mtx->mtx_oldipl;
> > +#ifdef MULTIPROCESSOR
> > +     membar_exit_before_atomic();
> > +#endif
> > +     mtx->mtx_owner = NULL;
> > +     if (mtx->mtx_wantipl != IPL_NONE)
> > +             splx(s);
> > +}
> 
> It should be somewhat faster to read mtx_wantipl while the lock is still
> held - it is less likely that someone else dirtied the cacheline. Then
> mtx_oldipl can be only conditionally read.
> 
> This function does not do atomic ops, so membar_exit_before_atomic does
> not really fit, even though it happens to do the exact same thing the
> "correctly" named primitive would - just provide a compiler barrier on
> amd64/i386.
> 
> From what I gather you are trying to mimick illumos nomenclature, but
> they don't have an equivalent afaics. (perhaps Solaris grew one in the
> meantime?)
> 
> In FreeBSD an appropriate routine is named atomic_thread_fence_rel (see
> amd64/include/atomic.h) and I suggest just borrowing the api.

API changes are subject to bikescheding so I'd suggest we concentrate on
the nice optimizations/improvements you're pointing out.

> Side note is that you probably can shorten ipl to vars to make the lock
> smaller. It can be doable to fit it into lock word, but I don't know how
> much
> sense would playing with it make.

One reason to have MI locks is also to make debugging easier.  I don't
know if shrinking the size of the structure goes in the same direction.
Time will tell.

Reply via email to