Hi Ross,

thanks for your numbers. So it looks like the lock prefix inside the reference counters will have on older processors - exactly where it's not needed at all - an impact which dwarfs even the costs for not inlining the reference counter. I'll have a look at it.

Heiner

Ross Johnson wrote:
On Mon, 2006-04-24 at 16:19 +0200, Jens-Heiner Rechtien wrote:
Ross Johnson wrote:
The timings for the Interlocked routine calling and for the inlined non-
locked asm using MSVC 6 were almost identical, whereas the inlined
locked asm was much slower. The same tests using GCC showed the
Interlocked calls to be similar to the slower locked asm from either GCC
or MSVC. The inlined non-locked asm for MSVC and GCC were very similar.
GCC may have been a little faster, reflecting that gcc can optimise the
inlined asm by substituting registers.
Your measurements seem to suggest that Microsoft uses a conditional approach in the non-inlined version of the Interlocked[In|De]crement routines, without lock prefix for older processors/single processor systems. The additional check penalizes newer HT/multicore/multiprocessor systems, if it matters at all needs to be measured.

I checked my notes and there isn't actually much difference between MSVC
and GCC when calling the Interlocked routines - they are both faster
than results using the locked prefix. So it isn't the compiler but the
dll itself that appears to conditionally switch.

I had also completely forgotten that I emulate xchg using cmpxchg to
avoid the mandatory buss lock on that instruction.

The following results are for an application that performs saturated
pthreads reader-writer lock operations using pthreads-win32, which uses
xchg and cmpxchg inline assembler in the underlying mutex and semaphore
routines.

It's not important, but these times are total milliseconds for 5 threads
to perform 1,000,000 access operations each (about 50,000 writes and
950,000 reads each). On a 2.4GHz i686 single-core, single processor.

-------------------------------------------------------------------
                inlined         inlined         call
                PTW32 xchg      XCHG            InterlockedExchange
...................................................................
GCC             641             1687            765

VC6.0           844             1750            891
-------------------------------------------------------------------

As you say below, the XCHG instruction always locks the buss, so I
emulate the XCHG instruction using the CMPXCHG instruction. The times
for the emulated xchg (inlined PTW32 xchg in the table) and the real
XCHG instruction (inlined XCHG) are shown above. The InterlockedExchange
call timings suggest that it also uses cmpxchg instead of xchg, and that
it doesn't use the lock prefix.

Even though the time spent in the xchg operation is a small proportion
of the whole application, the difference in overall run time with and
without the lock prefix (first two result columns) is 2 to 2.5 times.

AFAIK, the xchg etc instructions are atomic without the lock prefix on
the single (non-hyperthreaded (TM)) processor system that I'm still
using.
The Intel manuals states that xchg implicitly behaves as if it had a lock prefix.

Yes. See above.

BTW, on newer processors (P4, Xeon etc) the "lock" prefix shouldn't be that expensive, because if the target memory of the instruction is cacheable the CPU will not assert the Lock# signal (which locks the bus) but only lock the affected cache line.

Otherwise, for some very specific multithreaded applications, a single
processor can still beat two processors working together. :o)

Ross


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Jens-Heiner Rechtien
[EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to