Re: [dev] x86 osl/interlck.h performance

Ross Johnson Fri, 21 Apr 2006 19:01:28 -0700

On Fri, 2006-04-21 at 18:32 +0200, Jens-Heiner Rechtien wrote:
> Ross Johnson wrote:
> > On Fri, 2006-04-21 at 15:09 +0200, Stephan Bergmann wrote:
> >> Hi all,
> >>
> >> Someone recently mentioned that osl_increment/decrementInterlockedCount 
> >> would show up as top scorers with certain profiling tools (vtune?). 
> >> That got me thinking.  On both Linux x86 and Windows x86, those 
> >> functions are implemented in assembler, effectively consisting of a 
> >> LOCK-prefixed XADD.  Now, I thought that, at least on a uniprocessor 
> >> machine, the LOCK would probably not be that expensive, but that the 
> >> profiling tool in question might be confused by it and present bogus 
> >> results.
> >>
> >> However, the following little program on Linux x86 (where incLocked is a 
> >> copy of osl_incrementInterlockedCount, and incUnlocked is the same, 
> >> without the LOCK prefix) told a different story:
> > 
> >>From a completely different project (pthreads-win32) I have seen the
> > same thing and was surprised.
> > 
> > I had read that the LOCK prefix has no effect for uni-processors.
> > However, on a single CPU system, the LOCK prefix slowed the interlocked
> > instructions down considerably. In this case, it was the xchg and
> > cmpxchg instructions - the same ones that are at the centre of the Win32
> > API InterlockedExchange routines.
> > 
> > I also found from timing tests using hand-optimised assembler that calls
> > to the Win32 API Interlocked routines appeared to be optimised when the
> > code is compiled by MSVC, but not GCC (say). It was as though MSVC was
> > emitting optimised assembler on the fly instead of calling the routines
> > in Kernel32.dll. My timings showed that the standard Interlocked routine
> > calls compiled with MSVC were as fast or faster than my inlined
> > assembler without the LOCK prefix. The interlocked routines are used as
> > the basis for the mutex operations in pthreads-win32, and using the
> > assembler versions, I was able to cut the time for some of the pthreads-
> > win32 test applications involving saturated POSIX reader-writer lock
> > calls to nearly 1/3 for the gcc compiled versions, and match the times
> > produced by the MSVC compiled code.
> 
> Now that's interesting! Did you disassemble what MSVC emits instead of 
> calling the interlocked routines. How do they achieve atomic operations 
> without the lock prefix to xadd, xchg or cmpxchg instructions?


No - but I did check the emitted assembler for both compilers to make
sure that the inlining and lock prefixing was as I expected.

The timings for the Interlocked routine calling and for the inlined non-
locked asm using MSVC 6 were almost identical, whereas the inlined
locked asm was much slower. The same tests using GCC showed the
Interlocked calls to be similar to the slower locked asm from either GCC
or MSVC. The inlined non-locked asm for MSVC and GCC were very similar.
GCC may have been a little faster, reflecting that gcc can optimise the
inlined asm by substituting registers.

AFAIK, the xchg etc instructions are atomic without the lock prefix on
the single (non-hyperthreaded (TM)) processor system that I'm still
using.

Ross


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [dev] x86 osl/interlck.h performance

Reply via email to