On 10/03/2012 07:01 AM, Liviu Nicoara wrote:
On 10/02/12 10:41, Martin Sebor wrote:
I haven't had time to look at this since my last email on
Sunday. I also forgot about the string mutex. I don't think
I'll have time to spend on this until later in the week.
Unless the disassembly reveals the smoking gun, I think we
might need to simplify the test to get to the bottom of the
differences in our measurements. (I.e., eliminate the library
and measure the runtime of a simple thread loop, with and
without locking.) We should also look at the GLIBC and
kernel versions on our systems, on the off chance that
there has been a change that could explain the discrepancy
between my numbers and yours. I suspect my system (RHEL 4.8)
is much older than yours (I don't remember now if you posted
your details).

I am gathering some more measurements along these lines but it's time
consuming. I estimate I will have some ready for review later today or
tomorrow. In the meantime could you please post your kernel, glibc and
compiler versions?

I was just thinking of a few simple loops along the lines of:

  void* thread_func (void*) {
      for (int i = 0; i < N; ++)
          test 1: do some simple stuff inline
          test 2: call a virtual function to do the same stuff
          test 3: lock and unlock a mutex and do the same stuff

Test 1 should be the fastest and test 3 the slowest. This should
hold regardless of what "simple stuff" is (eventually, even when
it's getting numpunct::grouping() data).

For the Linux tests I used a 16 CPU (Xeon X5570 @ 3GHz) box with
RHEL 4.8 with 2.6.9-89.0.11.ELlargesmp, GLIBC version is 2.3.4,
and GCC 3.4.6.




On 10/02/2012 06:22 AM, Liviu Nicoara wrote:
On 09/30/12 18:18, Martin Sebor wrote:
I see you did a 64-bit build while I did a 32-bit one. so
I tried 64-bits. The cached version (i.e., the one compiled
with -UNO_USE_NUMPUNCT_CACHE) is still about twice as fast
as the non-cached one (compiled with -DNO_USE_NUMPUNCT_CACHE).

I had made one change to the test program that I thought might
account for the difference: I removed the call to abort from
the thread function since it was causing the process to exit
prematurely in some of my tests. But since you used the
modified program for your latest measurements that couldn't
be it.

I can't explain the differences. They just don't make sense
to me. Your results should be the other way around. Can you
post the disassembly of function f() for each of the two
configurations of the test?

The first thing that struck me in the cached `f' was that __string_ref
class uses a mutex for synchronizing access to the ref counter. It turns
out, for Linux on AMD64 we explicitly use a mutex instead of the atomic
ops on the ref counter, via a block in rw/_config.h:

// on Linux/AMD64, unless explicitly requested, disable the use
// of atomic operations in string for binary compatibility with
// stdcxx 4.1.x
# endif // _WIN32
# endif // stdcxx < 5.0

That is not the cause for the performance difference, though. Even after
building with __RWSTD_USE_STRING_ATOMIC_OPS I get the same better
performance with the non-cached version.


Reply via email to