> The original discussion started about the runq spin lock, but I
> think the scope of the problem is more general and the solution can be
> applied in user and kernel space both.  While in user space you would
> do sleep(0) in the kernel you would sched() or if you are in the
> scheduler you would loop doing mwait (see my last email).

sched can't be called from sched!

> "The MWAIT instruction can be executed at any privilege level.  The
> MONITOR CPUID feature flag (ECX[bit 3] when CPUID is executed with EAX
> = 1) indicates the availability of the MONITOR and MWAIT instruction
> in a processor.  When set, the unconditional execution of MWAIT is
> supported at privilege level 0 and conditional execution is supported
> at privilege levels 1 through 3 (software should test for the
> appropriate support of these instructions before unconditional use)."
> 
> There are also other extensions, which I have not tried.  I think the
> ideas can be used in the kernel or in user space, though I have only
> tried it the kernel and the implementation is only in the kernel right
> now.

thanks, i didn't see that.

> > i assume you mean that there is contention on the cacheline holding
> > the runq lock?  i don't think there's classical congestion.  as i
> > believe cachelines not involved in the mwait would experience no
> > hold up.
> >
> 
> I mean congestion in the classical network sense.  There are switches
> and links to exchange messages for the coherency protocol and some
> them get congested.  What I was seeing is the counter of messages
> growing very very fast and the performance degrading which I interpret
> as something getting congested.  I think when the lock possession is
> pingponged around (not necessarily contented, but many changes in who
> is holding the lock or maybe contention) many messages are generated
> and then the problem occurs.  I certainly saw the HW counters for
> messages go up orders of magnitude when I was not using mwait.

any memory access makes the MESI protocol do work.  i'm still not
convinced that pounding one cache line can create enough memory traffic
to sink uninvolved processors.  (but i'm not not convinced either.)

> I think the latency of mwait (this is what I remember for the opterons
> I was measuring, probably different in intel and in other amd models)

opterons have traditionally had terrible memory latency.  especially
when crossing packages.

> is actually worse (bigger) than with spinning, but if you have enough
> processors doing the spinning (not necessarily on the same locks, but

are you sure that you're getting fair interleaving with the spin locks?  if
in fact you're interleaving on big scales (say the same processor gets the
lock 100 times in a row), that's cheating a bit, isn't it?

also, in user space, there is a 2 order of magnitude difference between sleep(0)
and a wakeup.  and that's one of the main reasons that the semaphore
based lock measures quite poorly.  since the latency is not in sleep and 
wakeup, it
appears that it's in context switching.

> Let us know what your conclusions are after you implement and
> measure them :-).
> 

ticket locks (and some other stuff) are the difference between 200k iops and
1m iops.

- erik

Reply via email to