Re: Fwd: Re: STDCXX-1071 numpunct facet defect

2012-10-21 Thread Liviu Nicoara

On 10/03/12 11:10, Martin Sebor wrote:

[...]
I was just thinking of a few simple loops along the lines of:



tl;dr: I consider the results of the multi-threaded performance tests (12S, 
Intel/AMD multicores) as coming from heavy contention in copying of 
reference-counted std::string objects.

The long version: I went back to the original test case (attached, t.cpp). As a 
reminder, the program is a simplification of the library numpunct facet 
creation process, which can be compiled either with data caching enabled or 
disabled.

I ran the perf tool (https://perf.wiki.kernel.org/index.php/Main_Page) on the 
program, and I noticed marked differences in the number of instructions and 
cycles (an order of magnitude) and also in the number of CPU-migrations and 
context-switches (way larger for the cached version):

 Performance counter stats for './t':

  121168.026301  task-clock-msecs #  3.778 CPUs
6133377  context-switches #  0.051 M/sec
  35457  CPU-migrations   #  0.000 M/sec
298  page-faults  #  0.000 M/sec
   338006437277  cycles   #   2789.568 M/sec
   136554855096  instructions #  0.404 IPC
31298346444  branches #258.305 M/sec
 2305761092  branch-misses#  7.367 %
 1592157186  cache-references # 13.140 M/sec
1729755  cache-misses #  0.014 M/sec

   32.070381363  seconds time elapsed

vs.

 Performance counter stats for './t':

   16194.235282  task-clock-msecs #  3.883 CPUs
   7491  context-switches #  0.000 M/sec
 47  CPU-migrations   #  0.000 M/sec
311  page-faults  #  0.000 M/sec
45210281987  cycles   #   2791.752 M/sec
76784904688  instructions #  1.698 IPC
18891711522  branches #   1166.570 M/sec
 3162251552  branch-misses# 16.739 %
3332839  cache-references #  0.206 M/sec
 119424  cache-misses #  0.007 M/sec

4.170904279  seconds time elapsed


Then, I have run more detailed reports on the runs. I noticed that the report 
for the cached version contains what appears to be thread synchronization 
overhead, whereas the non-cached version doesn't:

# Events: 134K cycles
#
# Overhead  Command   Shared Object   Symbol
#   ...  ..  ...
#
50.69%t  [kernel.kallsyms]   [k] _raw_spin_lock
 5.46%t  libpthread-2.13.so  [.] pthread_mutex_lock
 4.16%t  libpthread-2.13.so  [.] __pthread_mutex_unlock
 3.65%t  libpthread-2.13.so  [.] __lll_lock_wait
 3.35%t  [kernel.kallsyms]   [k] copy_user_generic_string
 3.09%t  40395b  [.]   40395b
 2.91%t  [kernel.kallsyms]   [k] hash_futex
 2.87%t  [kernel.kallsyms]   [k] futex_wake
 2.67%t  t   [.] f
 2.37%t  [kernel.kallsyms]   [k] futex_wait_setup
 1.61%t  [kernel.kallsyms]   [k] system_call
 1.00%t  [kernel.kallsyms]   [k] system_call_after_swapgs
 0.98%t  [kernel.kallsyms]   [k] __switch_to
 0.96%t  libpthread-2.13.so  [.] __lll_unlock_wake
 0.95%t  [kernel.kallsyms]   [k] get_futex_key
 0.92%t  libc-2.13.so[.] __strlen_sse42
 [...]


vs.


# Events: 15K cycles
#
# Overhead  Command   Shared Object 

 Symbol
#   ...  ..  
..
#
49.55%   :295717f1ed3830a86  [.] 7f1ed3830a86
 8.67%t  libc-2.13.so[.] __malloc
 8.03%t  libc-2.13.so[.] cfree
 4.88%t  t   [.] std::basic_stringchar, std::char_traitschar, 
std::allocatorchar ::basic_string(char const*, std::allocatorchar const)
 4.67%t  libc-2.13.so[.] __strlen_sse42
 4.14%t  libpthread-2.13.so  [.] __pthread_mutex_unlock
 3.62%t  libpthread-2.13.so  [.] pthread_mutex_lock
 2.64%t  t   [.] f
 2.40%t  libc-2.13.so[.] _int_malloc
 2.10%t  libc-2.13.so[.] _int_free
 1.53%t  libpthread-2.13.so  [.] __pthread_mutex_init_internal
 1.33%t  t   [.] operator new(unsigned long)
 1.08%t  t   [.] X::numpunct::do_grouping() const
 1.04%t  libc-2.13.so[.] __memcpy_sse2
 

Re: [PATCH] STDCXX-1073

2012-10-21 Thread Martin Sebor

The library patch seems correct but the test case in the issue
and the new test aren't strictly conforming. The only requirement
on collate::tranform() is that it return...

A basic_stringcharT value that, compared lexicographically
with the result of calling transform() on another string,
yields the same result as calling do_compare() on the same
two strings.

There's no requirement that embedded NULs must be preserved
(that's just how it happens to be implemented). I find it best
to avoid relying on the knowledge of implementation details
when exercising the library so that tests don't start failing
after a conforming optimization or some such tweak is added
to the code.

Attached is a test case that reproduces the bug without relying
on implementation details. The test passes with your patch,
confirming that it's good. This test still makes the assumption
that strings that lexicographically distinct strings compare
unequal in the en_US locale, but I think that's a safe assumption
regardless of the encoding (though only for the hardcoded strings).

Martin

On 10/18/2012 04:28 PM, Liviu Nicoara wrote:

On 10/18/12 11:02, Martin Sebor wrote:

On 10/16/2012 10:38 AM, Liviu Nicoara wrote:

I have enhanced the collation test, 22.locale.collate.cpp with a bunch
of cases that deal with embedded strings, inside the input strings as
well as at the edges. More defects became apparent, and they have been
fixed.

I have re-worked the src/collate.cpp patch and I am attaching it here.
All collation tests (old and new) pass. If there are no objections, I
will check it in later in the week.


There are tabs in the patch (we need spaces). They also make
the patch harder to review than it otherwise would be. I would
also suggest keeping the MSVC  8.0 block and cleaning things
up separately. That will also make the patch easier to review.
(I'm so swamped that this matters to me.)


I looked at both (latest) patches and they do not have tabs in them.
They were sent in my last post on the 16th, the one you replied to. The
diff without -w is quite a bit verbose because I moved a whole block in
both collate.cpp functions, inside an if, so there are whitespace
differences. svn diff -w gave me a quite decent view of the changes
without having to looks at the actual patched source file.

I am re-attaching the patches, this time created without -w so they
would be a bit verbose. No tabs this time either.



I haven't looked at the test patch yet. If you could clean up
the tab issue and reduce the amount of unnecessary whitespace
changes I promise to review it this weekend.


The test enhancements are pretty plain, just a bunch of added stuff.

Thanks,
Liviu



#include cassert
#include cstdio
#include locale

int main () {
const std::locale loc (en_US);
typedef std::collatechar C;
const C col = std::use_facetC(loc);

static const struct {
const char *a;
unsigned n1;
const char *b;
unsigned n2;
} tests[] = {
#define T(s, t) { s, sizeof s  - 1, t, sizeof t - 1 }

T (, ),
T (, \0),
T (, \0\0),
T (\0, ),
T (\0, \0),
T (\0, \0\0),
T (a, \0),
T (a, \0a),
T (a, a\0),
T (a, a\0\0),
T (a\0, a),
T (a\0, a\0),
T (a\0, a\0\0),
T (\0a, ),
T (\0a, \0),
T (\0a, \0a),
T (\0a, \0a\0),
T (a\0\0b, ),
T (a\0\0b, a),
T (a\0\0b, ab),
T (a\0\0b, a\0),
T (a\0\0b, a\0\0),
T (a\0\0b, a\0b),
T (a\0\0b, a\0\0b),
};

for (unsigned i = 0; i != sizeof tests / sizeof *tests; ++i) {
const char* const a = tests [i].a;
unsigned n1 = tests [i].n1;
const char* const b = tests [i].b;
unsigned n2 = tests [i].n2;

const std::string x1 = col.transform (a, a + n1);
const std::string x2 = col.transform (b, b + n2);

const int colcmp = col.compare (a, a + n1, b, b + n2);
  int lexcmp = x1.compare (x2);
lexcmp = lexcmp  -1 ? -1 : 1  lexcmp ? 1 : lexcmp;

std::printf (%u: %i == %i\n, i, lexcmp, colcmp);

assert (colcmp == lexcmp);
if (colcmp)
assert (std::string (a, a + n1) != std::string (b, b + n2));
else
assert (std::string (a, a + n1) == std::string (b, b + n2));
}
}


Re: [PATCH] STDCXX-1073

2012-10-21 Thread Liviu Nicoara

On 10/21/12 19:08, Martin Sebor wrote:

There's no requirement that embedded NULs must be preserved
(that's just how it happens to be implemented). I find it best
to avoid relying on the knowledge of implementation details
when exercising the library so that tests don't start failing
after a conforming optimization or some such tweak is added
to the code.


That's right. I'll improve it.

Thanks for the review.

Liviu


std::string reference counting performance

2012-10-21 Thread Liviu Nicoara

Hi all,

I ran the attached test case, s.cpp, and timed it:

* Deep copy

$ for f in 16 8 6 4 2 1; do time ./s $f 1; done

16  0m39.292s   2m34.381s   0m0.035s
8   0m20.573s   1m18.131s   0m0.025s
6   0m15.037s   0m58.269s   0m0.009s
4   0m9.973s0m38.657s   0m0.009s
2   0m8.750s0m17.473s   0m0.016s
1   0m8.499s0m8.495s0m0.002s

* Shallow copy

$ for f in 16 8 6 4 2 1; do time ./s $f 1; done

16  5m29.311s   4m2.107s17m39.634s
8   2m43.635s   2m1.823s8m50.454s
6   2m10.601s   2m2.316s6m18.462s
4   1m20.766s   1m35.383s   3m29.467s
2   0m25.146s   0m27.775s   0m22.408s
1   0m3.557s0m3.555s0m0.000s

The results show a counter-intuitive better performance for deep-copied 
strings. Again, there is the reversal of performance for one-threaded 
programs where there is no contention.


Liviu
#include iostream
#include locale

#include cstdio
#include cstdlib
#include cstring

#include pthread.h
#include unistd.h

#define MAX_THREADS 128

static long nloops = 1000, nthreads = 16;
static bool volatile pwait = true;



extern C {

static void* 
f (void* pv)
{
std::string const s = *reinterpret_cast std::string*  (pv);

volatile unsigned long n = 0;

while (pwait) ;

for (int i = 0; i  nloops; ++i) {
#if !defined (NO_DEEP_COPY)
const std::string tmp (s.c_str ());
#else
const std::string tmp = s;
#endif // NO_DEEP_COPY

n += strlen (tmp.c_str ());
}

return (void*)n;
}

} // extern C

int
main (int argc, char** argv)
{
switch (argc) {
case 3:
nloops = atol (argv [2]);
case 2:
nthreads = atol (argv [1]);
break;
}

pthread_t tid [MAX_THREADS] = { 0 };

if (nthreads  MAX_THREADS)
nthreads = MAX_THREADS;

printf (%ld, %ld\n, nthreads, nloops);

pthread_setconcurrency (nthreads);

std::string s (Hello, world.);

for (int i = 0; i  nthreads; ++i) {
if (pthread_create (tid + i, 0, f, s))
exit (-1);
}

usleep (50);
pwait = false;

for (int i = 0; i  nthreads; ++i) {
if (tid [i])
pthread_join (tid [i], 0);
}

return 0;
}


* Deep-copy

$ for f in 16 8 6 4 2 1; do time ./s $f 1; done

16  0m39.292s   2m34.381s   0m0.035s
8   0m20.573s   1m18.131s   0m0.025s
6   0m15.037s   0m58.269s   0m0.009s
4   0m9.973s0m38.657s   0m0.009s
2   0m8.750s0m17.473s   0m0.016s
1   0m8.499s0m8.495s0m0.002s

* Ref-counted

16  5m29.311s   4m2.107s17m39.634s
8   2m43.635s   2m1.823s8m50.454s
6   2m10.601s   2m2.316s6m18.462s
4   1m20.766s   1m35.383s   3m29.467s
2   0m25.146s   0m27.775s   0m22.408s
1   0m3.557s0m3.555s0m0.000s


Re: std::string reference counting performance

2012-10-21 Thread Liviu Nicoara

On 10/21/12 21:11, Liviu Nicoara wrote:

Hi all,

I ran the attached test case, s.cpp, and timed it:


Gosh darn it, attached the wrong files. Here are the right ones.

Liviu


* Deep-copy

$ for f in 16 8 6 4 2 1; do time ./s $f 1; done

16  0m39.292s   2m34.381s   0m0.035s
8   0m20.573s   1m18.131s   0m0.025s
6   0m15.037s   0m58.269s   0m0.009s
4   0m9.973s0m38.657s   0m0.009s
2   0m8.750s0m17.473s   0m0.016s
1   0m8.499s0m8.495s0m0.002s

* Ref-counted

16  5m29.311s   4m2.107s17m39.634s
8   2m43.635s   2m1.823s8m50.454s
6   2m10.601s   2m2.316s6m18.462s
4   1m20.766s   1m35.383s   3m29.467s
2   0m25.146s   0m27.775s   0m22.408s
1   0m3.557s0m3.555s0m0.000s
#include iostream
#include locale

#include cstdio
#include cstdlib
#include cstring

#include pthread.h
#include unistd.h

#define MAX_THREADS 128

static long nloops = 1000, nthreads = 16;
static bool volatile pwait = true;



extern C {

static void* 
f (void* pv)
{
std::string const s = *reinterpret_cast std::string*  (pv);

volatile unsigned long n = 0;

while (pwait) ;

for (int i = 0; i  nloops; ++i) {
#if !defined (NO_DEEP_COPY)
const std::string tmp (s.c_str ());
#else
const std::string tmp = s;
#endif // NO_DEEP_COPY

n += strlen (tmp.c_str ());
}

return (void*)n;
}

} // extern C

int
main (int argc, char** argv)
{
switch (argc) {
case 3:
nloops = atol (argv [2]);
case 2:
nthreads = atol (argv [1]);
break;
}

pthread_t tid [MAX_THREADS] = { 0 };

if (nthreads  MAX_THREADS)
nthreads = MAX_THREADS;

printf (%ld, %ld\n, nthreads, nloops);

pthread_setconcurrency (nthreads);

std::string s (Hello, world.);

for (int i = 0; i  nthreads; ++i) {
if (pthread_create (tid + i, 0, f, s))
exit (-1);
}

usleep (50);
pwait = false;

for (int i = 0; i  nthreads; ++i) {
if (tid [i])
pthread_join (tid [i], 0);
}

return 0;
}



Re: std::string reference counting performance

2012-10-21 Thread Liviu Nicoara

On 10/21/12 21:44, Liviu Nicoara wrote:

On 10/21/12 21:11, Liviu Nicoara wrote:

Hi all,

I ran the attached test case, s.cpp, and timed it:


Gosh darn it, attached the wrong files. Here are the right ones.


Made the same mistake twice. My apologies for the clutter. Here are the 
right ones.


Liviu

* Deep-copy

$ for f in 16 8 4 2 1; do time ./s $f 5000; done
16, 50000m9.438s2m15.615s   0m10.799s
 8, 50000m6.973s0m55.579s   0m0.006s
 4, 50000m6.942s0m27.505s   0m0.001s
 2, 50000m6.888s0m13.753s   0m0.002s
 1, 50000m6.800s0m6.799s0m0.001s

* Ref-counted

$ for f in 16 8 4 2 1; do time ./s $f 5000; done
16, 50002m19.098s   2m36.278s   30m38.488s
 8, 50001m47.757s   2m7.278s10m18.418s
 4, 50004m27.425s   5m56.620s   10m13.892s
 2, 50000m17.667s   0m22.560s   0m10.695s
 1, 50000m2.923s0m2.921s0m0.001s

#include iostream
#include locale

#include cstdio
#include cstdlib
#include cstring

#include pthread.h
#include unistd.h

#define MAX_THREADS 128

static long nloops = 1000, nthreads = 16;
static bool volatile pwait = true;



extern C {

static void* 
f (void* pv)
{
std::string const s = *reinterpret_cast std::string*  (pv);

volatile unsigned long n = 0;

while (pwait) ;

for (int i = 0; i  nloops; ++i) {
#if !defined (NO_DEEP_COPY)
const std::string tmp (s.c_str (), s.size ());
#else
const std::string tmp = s;
#endif // NO_DEEP_COPY

n += strlen (tmp.c_str ());
}

return (void*)n;
}

} // extern C

int
main (int argc, char** argv)
{
switch (argc) {
case 3:
nloops = atol (argv [2]);
case 2:
nthreads = atol (argv [1]);
break;
}

pthread_t tid [MAX_THREADS] = { 0 };

if (nthreads  MAX_THREADS)
nthreads = MAX_THREADS;

printf (%ld, %ld\n, nthreads, nloops);

pthread_setconcurrency (nthreads);

std::string s (Hello, world.);

for (int i = 0; i  nthreads; ++i) {
if (pthread_create (tid + i, 0, f, s))
exit (-1);
}

usleep (50);
pwait = false;

for (int i = 0; i  nthreads; ++i) {
if (tid [i])
pthread_join (tid [i], 0);
}

return 0;
}




Re: std::string reference counting performance

2012-10-21 Thread Martin Sebor

This is a known problem. The plan is to change the default
string implementation in thread-safe configurations in 5.0
from reference counted to deep copying. Until then, it's
a library configuration option that must be explicitly
enabled. There should be a Jira task to make this change.

Martin

On 10/21/2012 07:11 PM, Liviu Nicoara wrote:

Hi all,

I ran the attached test case, s.cpp, and timed it:

* Deep copy

$ for f in 16 8 6 4 2 1; do time ./s $f 1; done

160m39.292s2m34.381s0m0.035s
80m20.573s1m18.131s0m0.025s
60m15.037s0m58.269s0m0.009s
40m9.973s0m38.657s0m0.009s
20m8.750s0m17.473s0m0.016s
10m8.499s0m8.495s0m0.002s

* Shallow copy

$ for f in 16 8 6 4 2 1; do time ./s $f 1; done

165m29.311s4m2.107s17m39.634s
82m43.635s2m1.823s8m50.454s
62m10.601s2m2.316s6m18.462s
41m20.766s1m35.383s3m29.467s
20m25.146s0m27.775s0m22.408s
10m3.557s0m3.555s0m0.000s

The results show a counter-intuitive better performance for deep-copied
strings. Again, there is the reversal of performance for one-threaded
programs where there is no contention.

Liviu