Hi everyone, [Long, rambling, and potentially boring email follows. If you are not interested by different threading implementations then please stop reading now.]
I thought I'd write a simple spin lock implementation, using inline functions and an atomic test-and-set operation to compare efficiency when implementing something like Apple's Grand Central. I wrote a simple program which spawned n threads, each incremented a 64-bit volatile global by m, acquiring and releasing a lock on either side of the increment, waited for all of the threads to exit, then tested that the counter was equal to n * m. I used a 64-bit global on i386 so that it would require two loads and two stores and give extra time for things to go wrong from lack of synchronisation. My spin lock was very simple; it just set an int value to 1 using an atomic test-and-set instruction and called sched_yield() (which gives scheduling time to another thread) if another thread held the lock. By setting n to 1, I could simulate the condition where there was no lock contention. On all the platforms I tried (OS X, GNU/Linux, and FreeBSD), performance was similar (within an order of magnitude) using a native pthread mutex and a spinlock. The interesting things happened when I set n to 4, giving more contention. On GNU/Linux and FreeBSD, performance using a pthread mutex was very close to using my spinlock - presumably they use very similar implementations internally, but have a cross-library function call on top. On OS X, the story was somewhat different. For small values of m, the performance was a couple of orders of magnitude worse. For larger values of m, it was so bad I gave up waiting. It seems that Darwin's libc does a couple of system calls for mutex operations, which gives horrendous performance. You can see this from the results of timing the program (n = 4, m=100,000) on OS X: real 0m15.937s user 0m4.881s sys 0m14.570s Note the sys value; almost all of the time is spent in the kernel. Interestingly, with n set to 1, almost no time was spent in the kernel; apparently Apple optimise heavily for the uncondented case although since running the program on a 2.16GH Core 2 Duo on OS X with n=1 was only 10% faster than running it with n=1 on a 1.15GHz Athlon running FreeBSD, I suspect that this is a serious case of misplaced optimisation. Note that this is not optimisation for the purely single-threaded case; spawning another thread that does nothing does not change the results by a measurable amount. Out of interest, I tried the same test program using the libobjc mutex functions and with no locking (n=4, m=100,00,000) on FreeBSD. The results were about what I'd expect (I've only quoted the user times, because the system times are approximately 0 in all cases and the real time is close to the user time in all cases): objc: user 0m8.607s pthread: user 0m3.493s spinlock (lock functions not inlined; inlining shaves around 30-40% off times): user 0m1.372s no locking (generates the wrong result about 50% of the time): user 0m0.219s The no locking version, obviously, is much faster, but gives the wrong results. I also tried a version of the spinlock that didn't yield in case of contention and this performed about 50% worse than the one that did. Adding any kind of locking makes it a lot slower, with the performance differences between my spinlock, pthread (which does a little more error checking and has the call to position-independent code) and the libobjc (which adds some more layers of indirection) ones being about what I anticipated. I was surprised that the libobjc mutex is (on FreeBSD, at least; the GNU/Linux machine I had access to didn't have libobjc installed) so much slower than using pthreads directly. I was expecting it to be closer to 5-6s. For comparison, I wrote a trivial class which inherited from Object and provided -lock and -unlock methods that called the corresponding pthread mutex functions and was surprised to discover that this was faster than using libobjc functions; somehow they add more overhead than a message lookup. pthread via message send: user 0m6.308s (This is good news; I recently rewrote all of the GNUstep synchronisation classes to use pthreads directly instead of the libobjc abstraction layer and Gregory is going to commit my diff after the next release of -base, so expect NSLock to speed up a lot at the next major [ABI-breaking] GNUstep-base release. Extrapolating from these numbers I'd expect using the current NSLock to take almost twice as long). One of the main reasons I did this was that I was interested by Apple's claim that adding a new object to a Grand Central Dispatch queue in 15 instructions. This is somewhat disingenuous. Locking and unlocking the spinlock is 5 instructions, but these instructions acquire the lock line and so are not particularly fast instructions. In terms of real-world performance, acquiring and releasing a pthread mutex is fairly close (factor of two) on any platform I tried except OS X; maybe they'd shout about it a bit less if their pthread implementation[1] were not an embarrassment, as the performance gain relative to Apple's mutex implementation is a factor of 100 or so, while relative to anyone else's (including one I wrote in five minutes) is closer to a factor of 2. In short, not an optimisation worth bothering with unless we discover that it's really a bottleneck. David [1] I also tried on OS X using their implementation of NSLock and got almost identical numbers to using pthreads directly; the overhead of the message send is very small. _______________________________________________ Etoile-dev mailing list Etoile-dev@gna.org https://mail.gna.org/listinfo/etoile-dev