I haven't had time to look at this since my last email on
Sunday. I also forgot about the string mutex. I don't think
I'll have time to spend on this until later in the week.
Unless the disassembly reveals the smoking gun, I think we
might need to simplify the test to get to the bottom of the
differences in our measurements. (I.e., eliminate the library
and measure the runtime of a simple thread loop, with and
without locking.) We should also look at the GLIBC and
kernel versions on our systems, on the off chance that
there has been a change that could explain the discrepancy
between my numbers and yours. I suspect my system (RHEL 4.8)
is much older than yours (I don't remember now if you posted
your details).

Martin

On 10/02/2012 06:22 AM, Liviu Nicoara wrote:
On 09/30/12 18:18, Martin Sebor wrote:
I see you did a 64-bit build while I did a 32-bit one. so
I tried 64-bits. The cached version (i.e., the one compiled
with -UNO_USE_NUMPUNCT_CACHE) is still about twice as fast
as the non-cached one (compiled with -DNO_USE_NUMPUNCT_CACHE).

I had made one change to the test program that I thought might
account for the difference: I removed the call to abort from
the thread function since it was causing the process to exit
prematurely in some of my tests. But since you used the
modified program for your latest measurements that couldn't
be it.

I can't explain the differences. They just don't make sense
to me. Your results should be the other way around. Can you
post the disassembly of function f() for each of the two
configurations of the test?

The first thing that struck me in the cached `f' was that __string_ref
class uses a mutex for synchronizing access to the ref counter. It turns
out, for Linux on AMD64 we explicitly use a mutex instead of the atomic
ops on the ref counter, via a block in rw/_config.h:

# if _RWSTD_VER_MAJOR < 5
# ifdef _RWSTD_OS_LINUX
// on Linux/AMD64, unless explicitly requested, disable the use
// of atomic operations in string for binary compatibility with
// stdcxx 4.1.x
# ifndef _RWSTD_USE_STRING_ATOMIC_OPS
# define _RWSTD_NO_STRING_ATOMIC_OPS
# endif // _RWSTD_USE_STRING_ATOMIC_OPS
# endif // _WIN32
# endif // stdcxx < 5.0


That is not the cause for the performance difference, though. Even after
building with __RWSTD_USE_STRING_ATOMIC_OPS I get the same better
performance with the non-cached version.

Liviu

Reply via email to