Re: [dev] x86 osl/interlck.h performance

Ross Johnson Fri, 21 Apr 2006 07:41:03 -0700

On Fri, 2006-04-21 at 15:09 +0200, Stephan Bergmann wrote:
> Hi all,
> 
> Someone recently mentioned that osl_increment/decrementInterlockedCount 
> would show up as top scorers with certain profiling tools (vtune?). 
> That got me thinking.  On both Linux x86 and Windows x86, those 
> functions are implemented in assembler, effectively consisting of a 
> LOCK-prefixed XADD.  Now, I thought that, at least on a uniprocessor 
> machine, the LOCK would probably not be that expensive, but that the 
> profiling tool in question might be confused by it and present bogus 
> results.
> 
> However, the following little program on Linux x86 (where incLocked is a 
> copy of osl_incrementInterlockedCount, and incUnlocked is the same, 
> without the LOCK prefix) told a different story:


>From a completely different project (pthreads-win32) I have seen the
same thing and was surprised.

I had read that the LOCK prefix has no effect for uni-processors.
However, on a single CPU system, the LOCK prefix slowed the interlocked
instructions down considerably. In this case, it was the xchg and
cmpxchg instructions - the same ones that are at the centre of the Win32
API InterlockedExchange routines.

I also found from timing tests using hand-optimised assembler that calls
to the Win32 API Interlocked routines appeared to be optimised when the
code is compiled by MSVC, but not GCC (say). It was as though MSVC was
emitting optimised assembler on the fly instead of calling the routines
in Kernel32.dll. My timings showed that the standard Interlocked routine
calls compiled with MSVC were as fast or faster than my inlined
assembler without the LOCK prefix. The interlocked routines are used as
the basis for the mutex operations in pthreads-win32, and using the
assembler versions, I was able to cut the time for some of the pthreads-
win32 test applications involving saturated POSIX reader-writer lock
calls to nearly 1/3 for the gcc compiled versions, and match the times
produced by the MSVC compiled code.

And I agree with the figure mentioned below, that the LOCK prefix slows
the x* instructions down by up to 8 times, or maybe even a bit more.

Ross Johnson

>    // lock.c
>    #include <stdio.h>
>    int incLocked(int * p) {
>      int n;
>      __asm__ __volatile__ (
>        "movl $1, %0\n\t"
>        "lock\n\t"
>        "xaddl %0, %2\n\t"
>        "incl %0" :
>        "=&r" (n), "=m" (*p) :
>        "m" (*p) :
>        "memory");
>      return n;
>    }
>    int incUnlocked(int * p) {
>      int n;
>      __asm__ __volatile__ (
>        "movl $1, %0\n\t"
>        "xaddl %0, %2\n\t"
>        "incl %0" :
>        "=&r" (n), "=m" (*p) :
>        "m" (*p) :
>        "memory");
>      return n;
>    }
>    int main(int argc, char ** argv) {
>      int i;
>      int n = 0;
>      if (argv[1][0] == 'l') {
>        puts("locked version");
>        for (i = 0; i < 100000000; ++i) {
>          incLocked(&n);
>        }
>      } else {
>        puts("unlocked version");
>        for (i = 0; i < 100000000; ++i) {
>          incUnlocked(&n);
>        }
>      }
>      return 0;
>    }
> 
> m1> cat /proc/cpuinfo
>    processor : 0
>    model name: Intel(R) Pentium(R) 4 CPU 1.80GHz
>    ...
> m1> time ./lock l
>    locked version
>    11.868u 0.000s 0:12.19 97.2%  0+0k 0+0io 0pf+0w
> m1> time ./lock u
>    unlocked version
>    1.516u 0.000s 0:01.57 96.1%  0+0k 0+0io 0pf+0w
> 
> m2> cat /proc/cpuinfo
>    processor : 0
>    model name: AMD Opteron(tm) Processor 242
>    processor : 1
>    model name: AMD Opteron(tm) Processor 242
>    ...
> m2> time ./lock l
>    locked version
>    1.863u 0.000s 0:01.86 100.0%  0+0k 0+0io 0pf+0w
> m2> time ./lock u
>    unlocked version
>    0.886u 0.000s 0:00.89 98.8%  0+0k 0+0io 0pf+0w
> 
> So, depending on CPU type, the version with LOCK is 2--8 times slower 
> than the version without LOCK.  Would be interesting to see whether this 
> has any actual impact on overall OOo performance.  (But first, I'm off 
> on vacation...)
> 
> -Stephan
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [dev] x86 osl/interlck.h performance

Reply via email to