On Mon, 2006-04-24 at 16:19 +0200, Jens-Heiner Rechtien wrote: > Ross Johnson wrote: > > > > The timings for the Interlocked routine calling and for the inlined non- > > locked asm using MSVC 6 were almost identical, whereas the inlined > > locked asm was much slower. The same tests using GCC showed the > > Interlocked calls to be similar to the slower locked asm from either GCC > > or MSVC. The inlined non-locked asm for MSVC and GCC were very similar. > > GCC may have been a little faster, reflecting that gcc can optimise the > > inlined asm by substituting registers. > > Your measurements seem to suggest that Microsoft uses a conditional > approach in the non-inlined version of the Interlocked[In|De]crement > routines, without lock prefix for older processors/single processor > systems. The additional check penalizes newer > HT/multicore/multiprocessor systems, if it matters at all needs to be > measured.
I checked my notes and there isn't actually much difference between MSVC and GCC when calling the Interlocked routines - they are both faster than results using the locked prefix. So it isn't the compiler but the dll itself that appears to conditionally switch. I had also completely forgotten that I emulate xchg using cmpxchg to avoid the mandatory buss lock on that instruction. The following results are for an application that performs saturated pthreads reader-writer lock operations using pthreads-win32, which uses xchg and cmpxchg inline assembler in the underlying mutex and semaphore routines. It's not important, but these times are total milliseconds for 5 threads to perform 1,000,000 access operations each (about 50,000 writes and 950,000 reads each). On a 2.4GHz i686 single-core, single processor. ------------------------------------------------------------------- inlined inlined call PTW32 xchg XCHG InterlockedExchange ................................................................... GCC 641 1687 765 VC6.0 844 1750 891 ------------------------------------------------------------------- As you say below, the XCHG instruction always locks the buss, so I emulate the XCHG instruction using the CMPXCHG instruction. The times for the emulated xchg (inlined PTW32 xchg in the table) and the real XCHG instruction (inlined XCHG) are shown above. The InterlockedExchange call timings suggest that it also uses cmpxchg instead of xchg, and that it doesn't use the lock prefix. Even though the time spent in the xchg operation is a small proportion of the whole application, the difference in overall run time with and without the lock prefix (first two result columns) is 2 to 2.5 times. > > > > AFAIK, the xchg etc instructions are atomic without the lock prefix on > > the single (non-hyperthreaded (TM)) processor system that I'm still > > using. > > The Intel manuals states that xchg implicitly behaves as if it had a > lock prefix. Yes. See above. > BTW, on newer processors (P4, Xeon etc) the "lock" prefix shouldn't be > that expensive, because if the target memory of the instruction is > cacheable the CPU will not assert the Lock# signal (which locks the bus) > but only lock the affected cache line. Otherwise, for some very specific multithreaded applications, a single processor can still beat two processors working together. :o) Ross --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]