On 09/30/12 18:18, Martin Sebor wrote:
I see you did a 64-bit build while I did a 32-bit one. so
I tried 64-bits. The cached version (i.e., the one compiled
with -UNO_USE_NUMPUNCT_CACHE) is still about twice as fast
as the non-cached one (compiled with -DNO_USE_NUMPUNCT_CACHE).

I had made one change to the test program that I thought might
account for the difference: I removed the call to abort from
the thread function since it was causing the process to exit
prematurely in some of my tests. But since you used the
modified program for your latest measurements that couldn't
be it.

I can't explain the differences. They just don't make sense
to me. Your results should be the other way around. Can you
post the disassembly of function f() for each of the two
configurations of the test?

The first thing that struck me in the cached `f' was that __string_ref class uses a mutex for synchronizing access to the ref counter. It turns out, for Linux on AMD64 we explicitly use a mutex instead of the atomic ops on the ref counter, via a block in rw/_config.h:

#  if _RWSTD_VER_MAJOR < 5
#    ifdef _RWSTD_OS_LINUX
       // on Linux/AMD64, unless explicitly requested, disable the use
       // of atomic operations in string for binary compatibility with
       // stdcxx 4.1.x
#      ifndef _RWSTD_USE_STRING_ATOMIC_OPS
#        define _RWSTD_NO_STRING_ATOMIC_OPS
#      endif   // _RWSTD_USE_STRING_ATOMIC_OPS
#    endif   // _WIN32
#  endif   // stdcxx < 5.0


That is not the cause for the performance difference, though. Even after building with __RWSTD_USE_STRING_ATOMIC_OPS I get the same better performance with the non-cached version.

Liviu

Reply via email to