Re: Fwd: Re: STDCXX-1071 numpunct facet defect
On 10/03/12 11:10, Martin Sebor wrote: [...] I was just thinking of a few simple loops along the lines of: tl;dr: I consider the results of the multi-threaded performance tests (12S, Intel/AMD multicores) as coming from heavy contention in copying of reference-counted std::string objects. The long version: I went back to the original test case (attached, t.cpp). As a reminder, the program is a simplification of the library numpunct facet creation process, which can be compiled either with data caching enabled or disabled. I ran the perf tool (https://perf.wiki.kernel.org/index.php/Main_Page) on the program, and I noticed marked differences in the number of instructions and cycles (an order of magnitude) and also in the number of CPU-migrations and context-switches (way larger for the cached version): Performance counter stats for './t': 121168.026301 task-clock-msecs # 3.778 CPUs 6133377 context-switches # 0.051 M/sec 35457 CPU-migrations # 0.000 M/sec 298 page-faults # 0.000 M/sec 338006437277 cycles # 2789.568 M/sec 136554855096 instructions # 0.404 IPC 31298346444 branches #258.305 M/sec 2305761092 branch-misses# 7.367 % 1592157186 cache-references # 13.140 M/sec 1729755 cache-misses # 0.014 M/sec 32.070381363 seconds time elapsed vs. Performance counter stats for './t': 16194.235282 task-clock-msecs # 3.883 CPUs 7491 context-switches # 0.000 M/sec 47 CPU-migrations # 0.000 M/sec 311 page-faults # 0.000 M/sec 45210281987 cycles # 2791.752 M/sec 76784904688 instructions # 1.698 IPC 18891711522 branches # 1166.570 M/sec 3162251552 branch-misses# 16.739 % 3332839 cache-references # 0.206 M/sec 119424 cache-misses # 0.007 M/sec 4.170904279 seconds time elapsed Then, I have run more detailed reports on the runs. I noticed that the report for the cached version contains what appears to be thread synchronization overhead, whereas the non-cached version doesn't: # Events: 134K cycles # # Overhead Command Shared Object Symbol # ... .. ... # 50.69%t [kernel.kallsyms] [k] _raw_spin_lock 5.46%t libpthread-2.13.so [.] pthread_mutex_lock 4.16%t libpthread-2.13.so [.] __pthread_mutex_unlock 3.65%t libpthread-2.13.so [.] __lll_lock_wait 3.35%t [kernel.kallsyms] [k] copy_user_generic_string 3.09%t 40395b [.] 40395b 2.91%t [kernel.kallsyms] [k] hash_futex 2.87%t [kernel.kallsyms] [k] futex_wake 2.67%t t [.] f 2.37%t [kernel.kallsyms] [k] futex_wait_setup 1.61%t [kernel.kallsyms] [k] system_call 1.00%t [kernel.kallsyms] [k] system_call_after_swapgs 0.98%t [kernel.kallsyms] [k] __switch_to 0.96%t libpthread-2.13.so [.] __lll_unlock_wake 0.95%t [kernel.kallsyms] [k] get_futex_key 0.92%t libc-2.13.so[.] __strlen_sse42 [...] vs. # Events: 15K cycles # # Overhead Command Shared Object Symbol # ... .. .. # 49.55% :295717f1ed3830a86 [.] 7f1ed3830a86 8.67%t libc-2.13.so[.] __malloc 8.03%t libc-2.13.so[.] cfree 4.88%t t [.] std::basic_stringchar, std::char_traitschar, std::allocatorchar ::basic_string(char const*, std::allocatorchar const) 4.67%t libc-2.13.so[.] __strlen_sse42 4.14%t libpthread-2.13.so [.] __pthread_mutex_unlock 3.62%t libpthread-2.13.so [.] pthread_mutex_lock 2.64%t t [.] f 2.40%t libc-2.13.so[.] _int_malloc 2.10%t libc-2.13.so[.] _int_free 1.53%t libpthread-2.13.so [.] __pthread_mutex_init_internal 1.33%t t [.] operator new(unsigned long) 1.08%t t [.] X::numpunct::do_grouping() const 1.04%t libc-2.13.so[.] __memcpy_sse2
Re: [PATCH] STDCXX-1073
The library patch seems correct but the test case in the issue and the new test aren't strictly conforming. The only requirement on collate::tranform() is that it return... A basic_stringcharT value that, compared lexicographically with the result of calling transform() on another string, yields the same result as calling do_compare() on the same two strings. There's no requirement that embedded NULs must be preserved (that's just how it happens to be implemented). I find it best to avoid relying on the knowledge of implementation details when exercising the library so that tests don't start failing after a conforming optimization or some such tweak is added to the code. Attached is a test case that reproduces the bug without relying on implementation details. The test passes with your patch, confirming that it's good. This test still makes the assumption that strings that lexicographically distinct strings compare unequal in the en_US locale, but I think that's a safe assumption regardless of the encoding (though only for the hardcoded strings). Martin On 10/18/2012 04:28 PM, Liviu Nicoara wrote: On 10/18/12 11:02, Martin Sebor wrote: On 10/16/2012 10:38 AM, Liviu Nicoara wrote: I have enhanced the collation test, 22.locale.collate.cpp with a bunch of cases that deal with embedded strings, inside the input strings as well as at the edges. More defects became apparent, and they have been fixed. I have re-worked the src/collate.cpp patch and I am attaching it here. All collation tests (old and new) pass. If there are no objections, I will check it in later in the week. There are tabs in the patch (we need spaces). They also make the patch harder to review than it otherwise would be. I would also suggest keeping the MSVC 8.0 block and cleaning things up separately. That will also make the patch easier to review. (I'm so swamped that this matters to me.) I looked at both (latest) patches and they do not have tabs in them. They were sent in my last post on the 16th, the one you replied to. The diff without -w is quite a bit verbose because I moved a whole block in both collate.cpp functions, inside an if, so there are whitespace differences. svn diff -w gave me a quite decent view of the changes without having to looks at the actual patched source file. I am re-attaching the patches, this time created without -w so they would be a bit verbose. No tabs this time either. I haven't looked at the test patch yet. If you could clean up the tab issue and reduce the amount of unnecessary whitespace changes I promise to review it this weekend. The test enhancements are pretty plain, just a bunch of added stuff. Thanks, Liviu #include cassert #include cstdio #include locale int main () { const std::locale loc (en_US); typedef std::collatechar C; const C col = std::use_facetC(loc); static const struct { const char *a; unsigned n1; const char *b; unsigned n2; } tests[] = { #define T(s, t) { s, sizeof s - 1, t, sizeof t - 1 } T (, ), T (, \0), T (, \0\0), T (\0, ), T (\0, \0), T (\0, \0\0), T (a, \0), T (a, \0a), T (a, a\0), T (a, a\0\0), T (a\0, a), T (a\0, a\0), T (a\0, a\0\0), T (\0a, ), T (\0a, \0), T (\0a, \0a), T (\0a, \0a\0), T (a\0\0b, ), T (a\0\0b, a), T (a\0\0b, ab), T (a\0\0b, a\0), T (a\0\0b, a\0\0), T (a\0\0b, a\0b), T (a\0\0b, a\0\0b), }; for (unsigned i = 0; i != sizeof tests / sizeof *tests; ++i) { const char* const a = tests [i].a; unsigned n1 = tests [i].n1; const char* const b = tests [i].b; unsigned n2 = tests [i].n2; const std::string x1 = col.transform (a, a + n1); const std::string x2 = col.transform (b, b + n2); const int colcmp = col.compare (a, a + n1, b, b + n2); int lexcmp = x1.compare (x2); lexcmp = lexcmp -1 ? -1 : 1 lexcmp ? 1 : lexcmp; std::printf (%u: %i == %i\n, i, lexcmp, colcmp); assert (colcmp == lexcmp); if (colcmp) assert (std::string (a, a + n1) != std::string (b, b + n2)); else assert (std::string (a, a + n1) == std::string (b, b + n2)); } }
Re: [PATCH] STDCXX-1073
On 10/21/12 19:08, Martin Sebor wrote: There's no requirement that embedded NULs must be preserved (that's just how it happens to be implemented). I find it best to avoid relying on the knowledge of implementation details when exercising the library so that tests don't start failing after a conforming optimization or some such tweak is added to the code. That's right. I'll improve it. Thanks for the review. Liviu
std::string reference counting performance
Hi all, I ran the attached test case, s.cpp, and timed it: * Deep copy $ for f in 16 8 6 4 2 1; do time ./s $f 1; done 16 0m39.292s 2m34.381s 0m0.035s 8 0m20.573s 1m18.131s 0m0.025s 6 0m15.037s 0m58.269s 0m0.009s 4 0m9.973s0m38.657s 0m0.009s 2 0m8.750s0m17.473s 0m0.016s 1 0m8.499s0m8.495s0m0.002s * Shallow copy $ for f in 16 8 6 4 2 1; do time ./s $f 1; done 16 5m29.311s 4m2.107s17m39.634s 8 2m43.635s 2m1.823s8m50.454s 6 2m10.601s 2m2.316s6m18.462s 4 1m20.766s 1m35.383s 3m29.467s 2 0m25.146s 0m27.775s 0m22.408s 1 0m3.557s0m3.555s0m0.000s The results show a counter-intuitive better performance for deep-copied strings. Again, there is the reversal of performance for one-threaded programs where there is no contention. Liviu #include iostream #include locale #include cstdio #include cstdlib #include cstring #include pthread.h #include unistd.h #define MAX_THREADS 128 static long nloops = 1000, nthreads = 16; static bool volatile pwait = true; extern C { static void* f (void* pv) { std::string const s = *reinterpret_cast std::string* (pv); volatile unsigned long n = 0; while (pwait) ; for (int i = 0; i nloops; ++i) { #if !defined (NO_DEEP_COPY) const std::string tmp (s.c_str ()); #else const std::string tmp = s; #endif // NO_DEEP_COPY n += strlen (tmp.c_str ()); } return (void*)n; } } // extern C int main (int argc, char** argv) { switch (argc) { case 3: nloops = atol (argv [2]); case 2: nthreads = atol (argv [1]); break; } pthread_t tid [MAX_THREADS] = { 0 }; if (nthreads MAX_THREADS) nthreads = MAX_THREADS; printf (%ld, %ld\n, nthreads, nloops); pthread_setconcurrency (nthreads); std::string s (Hello, world.); for (int i = 0; i nthreads; ++i) { if (pthread_create (tid + i, 0, f, s)) exit (-1); } usleep (50); pwait = false; for (int i = 0; i nthreads; ++i) { if (tid [i]) pthread_join (tid [i], 0); } return 0; } * Deep-copy $ for f in 16 8 6 4 2 1; do time ./s $f 1; done 16 0m39.292s 2m34.381s 0m0.035s 8 0m20.573s 1m18.131s 0m0.025s 6 0m15.037s 0m58.269s 0m0.009s 4 0m9.973s0m38.657s 0m0.009s 2 0m8.750s0m17.473s 0m0.016s 1 0m8.499s0m8.495s0m0.002s * Ref-counted 16 5m29.311s 4m2.107s17m39.634s 8 2m43.635s 2m1.823s8m50.454s 6 2m10.601s 2m2.316s6m18.462s 4 1m20.766s 1m35.383s 3m29.467s 2 0m25.146s 0m27.775s 0m22.408s 1 0m3.557s0m3.555s0m0.000s
Re: std::string reference counting performance
On 10/21/12 21:11, Liviu Nicoara wrote: Hi all, I ran the attached test case, s.cpp, and timed it: Gosh darn it, attached the wrong files. Here are the right ones. Liviu * Deep-copy $ for f in 16 8 6 4 2 1; do time ./s $f 1; done 16 0m39.292s 2m34.381s 0m0.035s 8 0m20.573s 1m18.131s 0m0.025s 6 0m15.037s 0m58.269s 0m0.009s 4 0m9.973s0m38.657s 0m0.009s 2 0m8.750s0m17.473s 0m0.016s 1 0m8.499s0m8.495s0m0.002s * Ref-counted 16 5m29.311s 4m2.107s17m39.634s 8 2m43.635s 2m1.823s8m50.454s 6 2m10.601s 2m2.316s6m18.462s 4 1m20.766s 1m35.383s 3m29.467s 2 0m25.146s 0m27.775s 0m22.408s 1 0m3.557s0m3.555s0m0.000s #include iostream #include locale #include cstdio #include cstdlib #include cstring #include pthread.h #include unistd.h #define MAX_THREADS 128 static long nloops = 1000, nthreads = 16; static bool volatile pwait = true; extern C { static void* f (void* pv) { std::string const s = *reinterpret_cast std::string* (pv); volatile unsigned long n = 0; while (pwait) ; for (int i = 0; i nloops; ++i) { #if !defined (NO_DEEP_COPY) const std::string tmp (s.c_str ()); #else const std::string tmp = s; #endif // NO_DEEP_COPY n += strlen (tmp.c_str ()); } return (void*)n; } } // extern C int main (int argc, char** argv) { switch (argc) { case 3: nloops = atol (argv [2]); case 2: nthreads = atol (argv [1]); break; } pthread_t tid [MAX_THREADS] = { 0 }; if (nthreads MAX_THREADS) nthreads = MAX_THREADS; printf (%ld, %ld\n, nthreads, nloops); pthread_setconcurrency (nthreads); std::string s (Hello, world.); for (int i = 0; i nthreads; ++i) { if (pthread_create (tid + i, 0, f, s)) exit (-1); } usleep (50); pwait = false; for (int i = 0; i nthreads; ++i) { if (tid [i]) pthread_join (tid [i], 0); } return 0; }
Re: std::string reference counting performance
On 10/21/12 21:44, Liviu Nicoara wrote: On 10/21/12 21:11, Liviu Nicoara wrote: Hi all, I ran the attached test case, s.cpp, and timed it: Gosh darn it, attached the wrong files. Here are the right ones. Made the same mistake twice. My apologies for the clutter. Here are the right ones. Liviu * Deep-copy $ for f in 16 8 4 2 1; do time ./s $f 5000; done 16, 50000m9.438s2m15.615s 0m10.799s 8, 50000m6.973s0m55.579s 0m0.006s 4, 50000m6.942s0m27.505s 0m0.001s 2, 50000m6.888s0m13.753s 0m0.002s 1, 50000m6.800s0m6.799s0m0.001s * Ref-counted $ for f in 16 8 4 2 1; do time ./s $f 5000; done 16, 50002m19.098s 2m36.278s 30m38.488s 8, 50001m47.757s 2m7.278s10m18.418s 4, 50004m27.425s 5m56.620s 10m13.892s 2, 50000m17.667s 0m22.560s 0m10.695s 1, 50000m2.923s0m2.921s0m0.001s #include iostream #include locale #include cstdio #include cstdlib #include cstring #include pthread.h #include unistd.h #define MAX_THREADS 128 static long nloops = 1000, nthreads = 16; static bool volatile pwait = true; extern C { static void* f (void* pv) { std::string const s = *reinterpret_cast std::string* (pv); volatile unsigned long n = 0; while (pwait) ; for (int i = 0; i nloops; ++i) { #if !defined (NO_DEEP_COPY) const std::string tmp (s.c_str (), s.size ()); #else const std::string tmp = s; #endif // NO_DEEP_COPY n += strlen (tmp.c_str ()); } return (void*)n; } } // extern C int main (int argc, char** argv) { switch (argc) { case 3: nloops = atol (argv [2]); case 2: nthreads = atol (argv [1]); break; } pthread_t tid [MAX_THREADS] = { 0 }; if (nthreads MAX_THREADS) nthreads = MAX_THREADS; printf (%ld, %ld\n, nthreads, nloops); pthread_setconcurrency (nthreads); std::string s (Hello, world.); for (int i = 0; i nthreads; ++i) { if (pthread_create (tid + i, 0, f, s)) exit (-1); } usleep (50); pwait = false; for (int i = 0; i nthreads; ++i) { if (tid [i]) pthread_join (tid [i], 0); } return 0; }
Re: std::string reference counting performance
This is a known problem. The plan is to change the default string implementation in thread-safe configurations in 5.0 from reference counted to deep copying. Until then, it's a library configuration option that must be explicitly enabled. There should be a Jira task to make this change. Martin On 10/21/2012 07:11 PM, Liviu Nicoara wrote: Hi all, I ran the attached test case, s.cpp, and timed it: * Deep copy $ for f in 16 8 6 4 2 1; do time ./s $f 1; done 160m39.292s2m34.381s0m0.035s 80m20.573s1m18.131s0m0.025s 60m15.037s0m58.269s0m0.009s 40m9.973s0m38.657s0m0.009s 20m8.750s0m17.473s0m0.016s 10m8.499s0m8.495s0m0.002s * Shallow copy $ for f in 16 8 6 4 2 1; do time ./s $f 1; done 165m29.311s4m2.107s17m39.634s 82m43.635s2m1.823s8m50.454s 62m10.601s2m2.316s6m18.462s 41m20.766s1m35.383s3m29.467s 20m25.146s0m27.775s0m22.408s 10m3.557s0m3.555s0m0.000s The results show a counter-intuitive better performance for deep-copied strings. Again, there is the reversal of performance for one-threaded programs where there is no contention. Liviu