On Mon, 2006-04-24 at 16:19 +0200, Jens-Heiner Rechtien wrote:
> Ross Johnson wrote:
> > 
> > The timings for the Interlocked routine calling and for the inlined non-
> > locked asm using MSVC 6 were almost identical, whereas the inlined
> > locked asm was much slower. The same tests using GCC showed the
> > Interlocked calls to be similar to the slower locked asm from either GCC
> > or MSVC. The inlined non-locked asm for MSVC and GCC were very similar.
> > GCC may have been a little faster, reflecting that gcc can optimise the
> > inlined asm by substituting registers.
> 
> Your measurements seem to suggest that Microsoft uses a conditional 
> approach in the non-inlined version of the  Interlocked[In|De]crement 
> routines, without lock prefix for older processors/single processor 
> systems. The additional check penalizes newer 
> HT/multicore/multiprocessor systems, if it matters at all needs to be 
> measured.

I checked my notes and there isn't actually much difference between MSVC
and GCC when calling the Interlocked routines - they are both faster
than results using the locked prefix. So it isn't the compiler but the
dll itself that appears to conditionally switch.

I had also completely forgotten that I emulate xchg using cmpxchg to
avoid the mandatory buss lock on that instruction.

The following results are for an application that performs saturated
pthreads reader-writer lock operations using pthreads-win32, which uses
xchg and cmpxchg inline assembler in the underlying mutex and semaphore
routines.

It's not important, but these times are total milliseconds for 5 threads
to perform 1,000,000 access operations each (about 50,000 writes and
950,000 reads each). On a 2.4GHz i686 single-core, single processor.

-------------------------------------------------------------------
                inlined         inlined         call
                PTW32 xchg      XCHG            InterlockedExchange
...................................................................
GCC             641             1687            765

VC6.0           844             1750            891
-------------------------------------------------------------------

As you say below, the XCHG instruction always locks the buss, so I
emulate the XCHG instruction using the CMPXCHG instruction. The times
for the emulated xchg (inlined PTW32 xchg in the table) and the real
XCHG instruction (inlined XCHG) are shown above. The InterlockedExchange
call timings suggest that it also uses cmpxchg instead of xchg, and that
it doesn't use the lock prefix.

Even though the time spent in the xchg operation is a small proportion
of the whole application, the difference in overall run time with and
without the lock prefix (first two result columns) is 2 to 2.5 times.

> > 
> > AFAIK, the xchg etc instructions are atomic without the lock prefix on
> > the single (non-hyperthreaded (TM)) processor system that I'm still
> > using.
> 
> The Intel manuals states that xchg implicitly behaves as if it had a 
> lock prefix.

Yes. See above.

> BTW, on newer processors (P4, Xeon etc) the "lock" prefix shouldn't be 
> that expensive, because if the target memory of the instruction is 
> cacheable the CPU will not assert the Lock# signal (which locks the bus) 
> but only lock the affected cache line.

Otherwise, for some very specific multithreaded applications, a single
processor can still beat two processors working together. :o)

Ross


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to