[Etoile-dev] Mutex performance

David Chisnall Sun, 09 Aug 2009 16:37:25 -0700

Hi everyone,

[Long, rambling, and potentially boring email follows.  If you are not  
interested by different threading implementations then please stop  
reading now.]


I thought I'd write a simple spin lock implementation, using inline  
functions and an atomic test-and-set operation to compare efficiency  
when implementing something like Apple's Grand Central.  I wrote a  
simple program which spawned n threads, each incremented a 64-bit  
volatile global by m, acquiring and releasing a lock on either side of  
the increment, waited for all of the threads to exit, then tested that  
the counter was equal to n * m.  I used a 64-bit global on i386 so  
that it would require two loads and two stores and give extra time for  
things to go wrong from lack of synchronisation.

My spin lock was very simple; it just set an int value to 1 using an  
atomic test-and-set instruction and called sched_yield() (which gives  
scheduling time to another thread) if another thread held the lock.

By setting n to 1, I could simulate the condition where there was no  
lock contention.  On all the platforms I tried (OS X, GNU/Linux, and  
FreeBSD), performance was similar (within an order of magnitude) using  
a native pthread mutex and a spinlock.

The interesting things happened when I set n to 4, giving more  
contention.  On GNU/Linux and FreeBSD, performance using a pthread  
mutex was very close to using my spinlock - presumably they use very  
similar implementations internally, but have a cross-library function  
call on top.  On OS X, the story was somewhat different.  For small  
values of m, the performance was a couple of orders of magnitude  
worse.  For larger values of m, it was so bad I gave up waiting.  It  
seems that Darwin's libc does a couple of system calls for mutex  
operations, which gives horrendous performance.  You can see this from  
the results of timing the program (n = 4, m=100,000) on OS X:

real    0m15.937s
user    0m4.881s
sys     0m14.570s

Note the sys value; almost all of the time is spent in the kernel.   
Interestingly, with n set to 1, almost no time was spent in the  
kernel; apparently Apple optimise heavily for the uncondented case  
although since running the program on a 2.16GH Core 2 Duo on OS X with  
n=1 was only 10% faster than running it with n=1 on a 1.15GHz Athlon  
running FreeBSD, I suspect that this is a serious case of misplaced  
optimisation.  Note that this is not optimisation for the purely  
single-threaded case; spawning another thread that does nothing does  
not change the results by a measurable amount.

Out of interest, I tried the same test program using the libobjc mutex  
functions and with no locking (n=4, m=100,00,000) on FreeBSD.  The  
results were about what I'd expect (I've only quoted the user times,  
because the system times are approximately 0 in all cases and the real  
time is close to the user time in all cases):

objc:
user    0m8.607s

pthread:
user    0m3.493s

spinlock (lock functions not inlined; inlining shaves around 30-40%  
off  times):
user    0m1.372s

no locking (generates the wrong result about 50% of the time):
user    0m0.219s

The no locking version, obviously, is much faster, but gives the wrong  
results.  I also tried a version of the spinlock that didn't yield in  
case of contention and this performed about 50% worse than the one  
that did.  Adding any kind of locking makes it a lot slower, with the  
performance differences between my spinlock, pthread (which does a  
little more error checking and has the call to position-independent  
code) and the libobjc (which adds some more layers of indirection)  
ones being about what I anticipated.

I was surprised that the libobjc mutex is (on FreeBSD, at least; the  
GNU/Linux machine I had access to didn't have libobjc installed) so  
much slower than using pthreads directly.  I was expecting it to be  
closer to 5-6s.  For comparison, I wrote a trivial class which  
inherited from Object and provided -lock and -unlock methods that  
called the corresponding pthread mutex functions and was surprised to  
discover that this was faster than using libobjc functions; somehow  
they add more overhead than a message lookup.

pthread via message send:
user    0m6.308s

(This is good news; I recently rewrote all of the GNUstep  
synchronisation classes to use pthreads directly instead of the  
libobjc abstraction layer and Gregory is going to commit my diff after  
the next release of -base, so expect NSLock to speed up a lot at the  
next major [ABI-breaking] GNUstep-base release.  Extrapolating from  
these numbers I'd expect using the current NSLock to take almost twice  
as long).

One of the main reasons I did this was that I was interested by  
Apple's claim that adding a new object to a Grand Central Dispatch  
queue in 15 instructions.  This is somewhat disingenuous.  Locking and  
unlocking the spinlock is 5 instructions, but these instructions  
acquire the lock line and so are not particularly fast instructions.   
In terms of real-world performance, acquiring and releasing a pthread  
mutex is fairly close (factor of two) on any platform I tried except  
OS X; maybe they'd shout about it a bit less if their pthread  
implementation[1] were not an embarrassment, as the performance gain  
relative to Apple's mutex implementation is a factor of 100 or so,  
while relative to anyone else's (including one I wrote in five  
minutes) is closer to a factor of 2.  In short, not an optimisation  
worth bothering with unless we discover that it's really a bottleneck.

David

[1] I also tried on OS X using their implementation of NSLock and got  
almost identical numbers to using pthreads directly; the overhead of  
the message send is very small.

_______________________________________________
Etoile-dev mailing list
Etoile-dev@gna.org
https://mail.gna.org/listinfo/etoile-dev

[Etoile-dev] Mutex performance

Reply via email to