Re: [dev] x86 osl/interlck.h performance

Jens-Heiner Rechtien Mon, 24 Apr 2006 07:19:59 -0700

Ross Johnson wrote:

On Fri, 2006-04-21 at 18:32 +0200, Jens-Heiner Rechtien wrote:
Ross Johnson wrote:
On Fri, 2006-04-21 at 15:09 +0200, Stephan Bergmann wrote:
Hi all,
Someone recently mentioned that osl_increment/decrementInterlockedCountwould show up as top scorers with certain profiling tools (vtune?).That got me thinking. On both Linux x86 and Windows x86, thosefunctions are implemented in assembler, effectively consisting of aLOCK-prefixed XADD. Now, I thought that, at least on a uniprocessormachine, the LOCK would probably not be that expensive, but that theprofiling tool in question might be confused by it and present bogusresults.
However, the following little program on Linux x86 (where incLocked is acopy of osl_incrementInterlockedCount, and incUnlocked is the same,without the LOCK prefix) told a different story:
>From a completely different project (pthreads-win32) I have seen the
same thing and was surprised.

I had read that the LOCK prefix has no effect for uni-processors.
However, on a single CPU system, the LOCK prefix slowed the interlocked
instructions down considerably. In this case, it was the xchg and
cmpxchg instructions - the same ones that are at the centre of the Win32
API InterlockedExchange routines.

I also found from timing tests using hand-optimised assembler that calls
to the Win32 API Interlocked routines appeared to be optimised when the
code is compiled by MSVC, but not GCC (say). It was as though MSVC was
emitting optimised assembler on the fly instead of calling the routines
in Kernel32.dll. My timings showed that the standard Interlocked routine
calls compiled with MSVC were as fast or faster than my inlined
assembler without the LOCK prefix. The interlocked routines are used as
the basis for the mutex operations in pthreads-win32, and using the
assembler versions, I was able to cut the time for some of the pthreads-
win32 test applications involving saturated POSIX reader-writer lock
calls to nearly 1/3 for the gcc compiled versions, and match the times
produced by the MSVC compiled code.
Now that's interesting! Did you disassemble what MSVC emits instead ofcalling the interlocked routines. How do they achieve atomic operationswithout the lock prefix to xadd, xchg or cmpxchg instructions?
No - but I did check the emitted assembler for both compilers to make
sure that the inlining and lock prefixing was as I expected.

The timings for the Interlocked routine calling and for the inlined non-
locked asm using MSVC 6 were almost identical, whereas the inlined
locked asm was much slower. The same tests using GCC showed the
Interlocked calls to be similar to the slower locked asm from either GCC
or MSVC. The inlined non-locked asm for MSVC and GCC were very similar.
GCC may have been a little faster, reflecting that gcc can optimise the
inlined asm by substituting registers.

Your measurements seem to suggest that Microsoft uses a conditionalapproach in the non-inlined version of the Interlocked[In|De]crementroutines, without lock prefix for older processors/single processorsystems. The additional check penalizes newerHT/multicore/multiprocessor systems, if it matters at all needs to bemeasured.


AFAIK, the xchg etc instructions are atomic without the lock prefix on
the single (non-hyperthreaded (TM)) processor system that I'm still
using.

The Intel manuals states that xchg implicitly behaves as if it had alock prefix.

BTW, on newer processors (P4, Xeon etc) the "lock" prefix shouldn't bethat expensive, because if the target memory of the instruction iscacheable the CPU will not assert the Lock# signal (which locks the bus)but only lock the affected cache line.


Heiner

--
Jens-Heiner Rechtien
[EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [dev] x86 osl/interlck.h performance

Reply via email to