Ross Johnson wrote:
On Fri, 2006-04-21 at 18:32 +0200, Jens-Heiner Rechtien wrote:
Ross Johnson wrote:
On Fri, 2006-04-21 at 15:09 +0200, Stephan Bergmann wrote:
Hi all,
Someone recently mentioned that osl_increment/decrementInterlockedCount
would show up as top scorers with certain profiling tools (vtune?).
That got me thinking. On both Linux x86 and Windows x86, those
functions are implemented in assembler, effectively consisting of a
LOCK-prefixed XADD. Now, I thought that, at least on a uniprocessor
machine, the LOCK would probably not be that expensive, but that the
profiling tool in question might be confused by it and present bogus
results.
However, the following little program on Linux x86 (where incLocked is a
copy of osl_incrementInterlockedCount, and incUnlocked is the same,
without the LOCK prefix) told a different story:
>From a completely different project (pthreads-win32) I have seen the
same thing and was surprised.
I had read that the LOCK prefix has no effect for uni-processors.
However, on a single CPU system, the LOCK prefix slowed the interlocked
instructions down considerably. In this case, it was the xchg and
cmpxchg instructions - the same ones that are at the centre of the Win32
API InterlockedExchange routines.
I also found from timing tests using hand-optimised assembler that calls
to the Win32 API Interlocked routines appeared to be optimised when the
code is compiled by MSVC, but not GCC (say). It was as though MSVC was
emitting optimised assembler on the fly instead of calling the routines
in Kernel32.dll. My timings showed that the standard Interlocked routine
calls compiled with MSVC were as fast or faster than my inlined
assembler without the LOCK prefix. The interlocked routines are used as
the basis for the mutex operations in pthreads-win32, and using the
assembler versions, I was able to cut the time for some of the pthreads-
win32 test applications involving saturated POSIX reader-writer lock
calls to nearly 1/3 for the gcc compiled versions, and match the times
produced by the MSVC compiled code.
Now that's interesting! Did you disassemble what MSVC emits instead of
calling the interlocked routines. How do they achieve atomic operations
without the lock prefix to xadd, xchg or cmpxchg instructions?
No - but I did check the emitted assembler for both compilers to make
sure that the inlining and lock prefixing was as I expected.
The timings for the Interlocked routine calling and for the inlined non-
locked asm using MSVC 6 were almost identical, whereas the inlined
locked asm was much slower. The same tests using GCC showed the
Interlocked calls to be similar to the slower locked asm from either GCC
or MSVC. The inlined non-locked asm for MSVC and GCC were very similar.
GCC may have been a little faster, reflecting that gcc can optimise the
inlined asm by substituting registers.
Your measurements seem to suggest that Microsoft uses a conditional
approach in the non-inlined version of the Interlocked[In|De]crement
routines, without lock prefix for older processors/single processor
systems. The additional check penalizes newer
HT/multicore/multiprocessor systems, if it matters at all needs to be
measured.
AFAIK, the xchg etc instructions are atomic without the lock prefix on
the single (non-hyperthreaded (TM)) processor system that I'm still
using.
The Intel manuals states that xchg implicitly behaves as if it had a
lock prefix.
BTW, on newer processors (P4, Xeon etc) the "lock" prefix shouldn't be
that expensive, because if the target memory of the instruction is
cacheable the CPU will not assert the Lock# signal (which locks the bus)
but only lock the affected cache line.
Heiner
--
Jens-Heiner Rechtien
[EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]