On 10 Aug 2009, at 23:42, Quentin Mathé wrote:

> Hi David,
>
> Le 10 août 09 à 01:36, David Chisnall a écrit :
>
>> One of the main reasons I did this was that I was interested by
>> Apple's claim that adding a new object to a Grand Central Dispatch
>> queue in 15 instructions.  This is somewhat disingenuous.  Locking  
>> and
>> unlocking the spinlock is 5 instructions, but these instructions
>> acquire the lock line and so are not particularly fast instructions.
>> In terms of real-world performance, acquiring and releasing a pthread
>> mutex is fairly close (factor of two) on any platform I tried
>
> Interesting to know.

Yup.  I was wondering if they were using some clever lockless  
structure, but if we just use an inline version of the little mutex I  
wrote using the GCC atomic ops (on OS X I had to use some inline asm  
because Apple's GCC 4.0 doesn't support atomic ops) then we should get  
the same performance.  Of course, a full implementation of GCD needs  
feedback from the kernel to let each process determine the optimum  
number of threads to spawn.  This can possibly be done using the  
existing code we have for counting the number of CPUs and a small  
amount extra for the amount of load.

>> except  OS X; maybe they'd shout about it a bit less if their pthread
>> implementation[1] were not an embarrassment, as the performance gain
>> relative to Apple's mutex implementation is a factor of 100 or so,
>> while relative to anyone else's (including one I wrote in five
>> minutes) is closer to a factor of 2.
>
> The result could be completly different in Mac OS X 10.6 though.
> In fact, I'd be suprised if they didn't rewrite their mutex
> implementation given the result you report.

Looking a bit more carefully, it appears that the problem is that they  
use a Mach semaphore rather than sched_yield() to make the mutex  
sleep.  This is an optimisation in cases where you expect threads to  
block for a long time but generally this is not how you use a pthread  
mutex (that's what a condition variable is for).  A better  
implementation would spin a few times without sleeping, a few more  
times calling sched_yield(), and then wait on a semaphore atomically  
setting a flag in the mutex to instruct the corresponding unlock call  
to signal the semaphore.  They may be doing this and just have really  
badly-tuned heuristics, in which case it's quite easy to fix.  If they  
haven't already fixed it though, they probably won't before 10.6 is  
released; changes to mutex implementation are not things that you  
should rush.

A perfect implementation of a mutex would encode the average waiting  
time for each {thread, mutex} pair and use this to determine the  
waiting policy.  In most cases, however, sched_yield() is sufficient  
because good code doesn't hold mutexes for very long (Amdahl's law)  
and so just kicking the other threads to run for a bit is usually  
sufficient to let the one that holds it give it up.  And, of course,  
you should never be using a mutex in a situation where you need  
latency guarantees.

In short, it looks like someone just copied the SysV semaphore code  
into the pthreads implementation without noting that SysV semaphores  
and pthread mutexes are used in very different ways.

David
_______________________________________________
Etoile-dev mailing list
Etoile-dev@gna.org
https://mail.gna.org/listinfo/etoile-dev

Reply via email to