Hi Ross,
thanks for your numbers. So it looks like the lock prefix inside the
reference counters will have on older processors - exactly where it's
not needed at all - an impact which dwarfs even the costs for not
inlining the reference counter. I'll have a look at it.
Heiner
Ross Johnson wrote:
On Mon, 2006-04-24 at 16:19 +0200, Jens-Heiner Rechtien wrote:
Ross Johnson wrote:
The timings for the Interlocked routine calling and for the inlined non-
locked asm using MSVC 6 were almost identical, whereas the inlined
locked asm was much slower. The same tests using GCC showed the
Interlocked calls to be similar to the slower locked asm from either GCC
or MSVC. The inlined non-locked asm for MSVC and GCC were very similar.
GCC may have been a little faster, reflecting that gcc can optimise the
inlined asm by substituting registers.
Your measurements seem to suggest that Microsoft uses a conditional
approach in the non-inlined version of the Interlocked[In|De]crement
routines, without lock prefix for older processors/single processor
systems. The additional check penalizes newer
HT/multicore/multiprocessor systems, if it matters at all needs to be
measured.
I checked my notes and there isn't actually much difference between MSVC
and GCC when calling the Interlocked routines - they are both faster
than results using the locked prefix. So it isn't the compiler but the
dll itself that appears to conditionally switch.
I had also completely forgotten that I emulate xchg using cmpxchg to
avoid the mandatory buss lock on that instruction.
The following results are for an application that performs saturated
pthreads reader-writer lock operations using pthreads-win32, which uses
xchg and cmpxchg inline assembler in the underlying mutex and semaphore
routines.
It's not important, but these times are total milliseconds for 5 threads
to perform 1,000,000 access operations each (about 50,000 writes and
950,000 reads each). On a 2.4GHz i686 single-core, single processor.
-------------------------------------------------------------------
inlined inlined call
PTW32 xchg XCHG InterlockedExchange
...................................................................
GCC 641 1687 765
VC6.0 844 1750 891
-------------------------------------------------------------------
As you say below, the XCHG instruction always locks the buss, so I
emulate the XCHG instruction using the CMPXCHG instruction. The times
for the emulated xchg (inlined PTW32 xchg in the table) and the real
XCHG instruction (inlined XCHG) are shown above. The InterlockedExchange
call timings suggest that it also uses cmpxchg instead of xchg, and that
it doesn't use the lock prefix.
Even though the time spent in the xchg operation is a small proportion
of the whole application, the difference in overall run time with and
without the lock prefix (first two result columns) is 2 to 2.5 times.
AFAIK, the xchg etc instructions are atomic without the lock prefix on
the single (non-hyperthreaded (TM)) processor system that I'm still
using.
The Intel manuals states that xchg implicitly behaves as if it had a
lock prefix.
Yes. See above.
BTW, on newer processors (P4, Xeon etc) the "lock" prefix shouldn't be
that expensive, because if the target memory of the instruction is
cacheable the CPU will not assert the Lock# signal (which locks the bus)
but only lock the affected cache line.
Otherwise, for some very specific multithreaded applications, a single
processor can still beat two processors working together. :o)
Ross
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--
Jens-Heiner Rechtien
[EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]