Ross Johnson wrote:
On Fri, 2006-04-21 at 15:09 +0200, Stephan Bergmann wrote:
Hi all,

Someone recently mentioned that osl_increment/decrementInterlockedCount would show up as top scorers with certain profiling tools (vtune?). That got me thinking. On both Linux x86 and Windows x86, those functions are implemented in assembler, effectively consisting of a LOCK-prefixed XADD. Now, I thought that, at least on a uniprocessor machine, the LOCK would probably not be that expensive, but that the profiling tool in question might be confused by it and present bogus results.

However, the following little program on Linux x86 (where incLocked is a copy of osl_incrementInterlockedCount, and incUnlocked is the same, without the LOCK prefix) told a different story:

From a completely different project (pthreads-win32) I have seen the
same thing and was surprised.

I had read that the LOCK prefix has no effect for uni-processors.
However, on a single CPU system, the LOCK prefix slowed the interlocked
instructions down considerably. In this case, it was the xchg and
cmpxchg instructions - the same ones that are at the centre of the Win32
API InterlockedExchange routines.

I also found from timing tests using hand-optimised assembler that calls
to the Win32 API Interlocked routines appeared to be optimised when the
code is compiled by MSVC, but not GCC (say). It was as though MSVC was
emitting optimised assembler on the fly instead of calling the routines
in Kernel32.dll. My timings showed that the standard Interlocked routine
calls compiled with MSVC were as fast or faster than my inlined
assembler without the LOCK prefix. The interlocked routines are used as
the basis for the mutex operations in pthreads-win32, and using the
assembler versions, I was able to cut the time for some of the pthreads-
win32 test applications involving saturated POSIX reader-writer lock
calls to nearly 1/3 for the gcc compiled versions, and match the times
produced by the MSVC compiled code.

Now that's interesting! Did you disassemble what MSVC emits instead of calling the interlocked routines. How do they achieve atomic operations without the lock prefix to xadd, xchg or cmpxchg instructions?


And I agree with the figure mentioned below, that the LOCK prefix slows
the x* instructions down by up to 8 times, or maybe even a bit more.

Heiner

--
Jens-Heiner Rechtien
[EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to