On 10 Aug 2009, at 23:42, Quentin Mathé wrote: > Hi David, > > Le 10 août 09 à 01:36, David Chisnall a écrit : > >> One of the main reasons I did this was that I was interested by >> Apple's claim that adding a new object to a Grand Central Dispatch >> queue in 15 instructions. This is somewhat disingenuous. Locking >> and >> unlocking the spinlock is 5 instructions, but these instructions >> acquire the lock line and so are not particularly fast instructions. >> In terms of real-world performance, acquiring and releasing a pthread >> mutex is fairly close (factor of two) on any platform I tried > > Interesting to know.
Yup. I was wondering if they were using some clever lockless structure, but if we just use an inline version of the little mutex I wrote using the GCC atomic ops (on OS X I had to use some inline asm because Apple's GCC 4.0 doesn't support atomic ops) then we should get the same performance. Of course, a full implementation of GCD needs feedback from the kernel to let each process determine the optimum number of threads to spawn. This can possibly be done using the existing code we have for counting the number of CPUs and a small amount extra for the amount of load. >> except OS X; maybe they'd shout about it a bit less if their pthread >> implementation[1] were not an embarrassment, as the performance gain >> relative to Apple's mutex implementation is a factor of 100 or so, >> while relative to anyone else's (including one I wrote in five >> minutes) is closer to a factor of 2. > > The result could be completly different in Mac OS X 10.6 though. > In fact, I'd be suprised if they didn't rewrite their mutex > implementation given the result you report. Looking a bit more carefully, it appears that the problem is that they use a Mach semaphore rather than sched_yield() to make the mutex sleep. This is an optimisation in cases where you expect threads to block for a long time but generally this is not how you use a pthread mutex (that's what a condition variable is for). A better implementation would spin a few times without sleeping, a few more times calling sched_yield(), and then wait on a semaphore atomically setting a flag in the mutex to instruct the corresponding unlock call to signal the semaphore. They may be doing this and just have really badly-tuned heuristics, in which case it's quite easy to fix. If they haven't already fixed it though, they probably won't before 10.6 is released; changes to mutex implementation are not things that you should rush. A perfect implementation of a mutex would encode the average waiting time for each {thread, mutex} pair and use this to determine the waiting policy. In most cases, however, sched_yield() is sufficient because good code doesn't hold mutexes for very long (Amdahl's law) and so just kicking the other threads to run for a bit is usually sufficient to let the one that holds it give it up. And, of course, you should never be using a mutex in a situation where you need latency guarantees. In short, it looks like someone just copied the SysV semaphore code into the pthreads implementation without noting that SysV semaphores and pthread mutexes are used in very different ways. David _______________________________________________ Etoile-dev mailing list Etoile-dev@gna.org https://mail.gna.org/listinfo/etoile-dev