Re: Status
On 04/15/13 09:35, Jim Jagielski wrote: Just a quick Email... the list has been pretty quiet lately and just pinging to see who's still watching and reading :) Still here.
Re: [PATCH] STDCXX-1073
On 10/25/12 11:31, Martin Sebor wrote: There are comments suggesting that calling do_transform() on the whole string may be suboptimal. Intuitively it makes sense that calling wcscoll() (in a loop, on the NUL-terminated substrings, if necessary) should be faster than simply calling do_transform() followed by wstring::compare(), but it would make sense to confirm the hypothesis before implementing the optimization. Alright, same files, two more patches. 2012-10-28 Liviu Nicoara lnico...@apache.org Fixes to collate facet and test enhancements: * src/collate.cpp (__rw_strnxfrm): corrected processing of embedded NULs. (__rw_wcsnxfrm) same (duplicated code). (collate_bynamewchar_t::do_compare): fixed string comparison return values, re-implemented the wcscoll-based comparison. * tests/localization/22.locale.collate.cpp: implemented a simpler collation test for strings with embedded NULs. All existing and new collate tests pass, but I think some more test enhancements are required. Liviu Index: tests/localization/22.locale.collate.cpp === --- tests/localization/22.locale.collate.cpp(revision 1402988) +++ tests/localization/22.locale.collate.cpp(working copy) @@ -116,7 +116,7 @@ const char* widen (char* dst, const char return dst; } -#ifndef _RWSTD_NO_WCHAR_T +#if !defined (_RWSTD_NO_WCHAR_T) int c_strcoll (const wchar_t* s1, const wchar_t* s2) { @@ -1029,65 +1029,119 @@ check_hash_eff (const char* charTname) template class charT void -check_NUL_locale (const char* charTname, const char* locname) +check_NUL_collate (const char* charTname, const char* locname, + const charT* s1, size_t s1_len, + const charT* s2, size_t s2_len) { std::locale loc (locname); -charT s [STR_SIZE]; -gen_str (s, STR_SIZE); +typedef typename std::collatecharT Collate; +typedef typename Collate::string_type String; -charT buf [2][STR_SIZE]; +const Collate col = std::use_facetCollate (loc); -std::memcpy (buf [0], s, sizeof s); -std::memcpy (buf [1], s, sizeof s); +const String x1 = col.transform (s1, s1 + s1_len); +const String x2 = col.transform (s2, s2 + s2_len); -// -// Verify that first buffer compares more: -// |0| = buf [0] -// |0| = buf [1] -// -buf [0][4] = charT (); -buf [1][3] = charT (); +const int colcmp = col.compare (s1, s1 + s1_len, s2, s2 + s2_len); -typedef std::collatecharT Collate; +int lexcmp = x1.compare (x2); +lexcmp = lexcmp -1 ? -1 : 1 lexcmp ? 1 : lexcmp; + +rw_assert (colcmp == lexcmp, __FILE__, __LINE__, + collate%s::compare (%{*.*Ac}, %{*.*Ac}) = %d, + lexicographical comparison of transformed strings = %d, + mismatch in locale (\%s\), charTname, + sizeof (charT), s1_len, s1, + sizeof (charT), s2_len, s2, + colcmp, lexcmp, locname); + +const bool eq = +std::string (s1, s1 + s1_len) == +std::string (s2, s2 + s2_len); + +rw_assert (bool (colcmp) != eq, __FILE__, __LINE__, + collate%s::compare (%{*.*Ac}, %{*.*Ac}) = %d, + lexicographical compare = %s, mismatch in locale (\%s\), + charTname, + sizeof (charT), s1_len, s1, + sizeof (charT), s2_len, s2, colcmp, + (eq ? true : false), locname); +} -const Collate col = std::use_facetCollate (loc); +static void +check_NUL_collate (const char* charTname, const char* locname, char) +{ +#define T(s, t) \ +check_NUL_collate (charTname, locname, \ + s, sizeof s / sizeof *s - 1, \ + t, sizeof t / sizeof *t - 1) + +T (, ); +T (, \0); +T (, \0\0); +T (\0, ); +T (\0, \0); +T (\0, \0\0); +T (a, \0); +T (a, \0a); +T (a, a\0); +T (a, a\0\0); +T (a\0, a); +T (a\0, a\0); +T (a\0, a\0\0); +T (\0a, ); +T (\0a, \0); +T (\0a, \0a); +T (\0a, \0a\0); +T (a\0\0b, ); +T (a\0\0b, a); +T (a\0\0b, ab); +T (a\0\0b, a\0); +T (a\0\0b, a\0\0); +T (a\0\0b, a\0b); +T (a\0\0b, a\0\0b); +} -int cmp = col.compare ( -buf [0], buf [0] + sizeof buf [0] / sizeof *buf [0], -buf [1], buf [1] + sizeof buf [1] / sizeof *buf [1]); - -rw_assert (cmp 0, __FILE__, __LINE__, - collate%s::compare (%{*.*Ac}, %{*.*Ac}) - 0, failed in locale (\%s\), charTname, - sizeof (charT), sizeof buf [0] / sizeof *buf [0], buf [0], - sizeof (charT), sizeof buf [1] / sizeof *buf [1], buf [1], - locname); - -std::memcpy (buf [0], s, sizeof s); -std::memcpy (buf [1], s, sizeof s); - -// -// Verify that first compare less
Re: Fwd: Re: STDCXX-1071 numpunct facet defect
On 10/26/12 18:50, Martin Sebor wrote: On 10/26/2012 06:50 AM, Liviu Nicoara wrote: [...] tl;dr: removing the facet data cache is a priority. All else can be put on the back-burner. [...] // facet data accessor ... if (0 == _C_impsize) { // 1 mutex_lock (); if (_C_impsize) return _C_data; _C_data = get_facet_data (); // 2 ?? // 3 _C_impsize = 1; // 4 mutex_unlock (); } ?? // 5 return _C_data; // 6 [...] Various compilers provide these features in various forms, but at the moment we don't have a unified STDCXX API to implement this. [...] I suggested moving the body of the outer if from the header into a .cpp file in the library where we could implement ugly, bloated locking without the risk of breaking things if we removed/replaced it in the future. That's when we ran into questions about how exactly to do this cleanly, etc. It didn't seem to be doable very cleanly but I still think it's a viable approach. Just making sure we are talking about the same thing. My argument (and the code oversimplification above) was about the initialization of the std::facet data, not the cached numpunct data. I think we should get rid of the numpunct cache right away because it makes the facet unusable in MT builds. As for the DCII I was talking about in the previous post, I would tackle it but I only have access to x86(_64). Does the foundation have shared dev machines? Liviu
Re: Fwd: Re: STDCXX-1071 numpunct facet defect
On 10/03/12 11:10, Martin Sebor wrote: [...] I was just thinking of a few simple loops along the lines of: void* thread_func (void*) { for (int i = 0; i N; ++) test 1: do some simple stuff inline test 2: call a virtual function to do the same stuff test 3: lock and unlock a mutex and do the same stuff } Test 1 should be the fastest and test 3 the slowest. This should hold regardless of what simple stuff is (eventually, even when it's getting numpunct::grouping() data). tl;dr: removing the facet data cache is a priority. All else can be put on the back-burner. Conflicting test results aside, there still is the case of the incorrect handling of the cached data in the facet. I don't think there is a disagreement on that. Considering that the std::string is moving in the direction of dropping the handle-body implementation, simply getting rid of the cache is a step in the same direction. I think that we should preserve the lock-free reading of the facet data, as a benign race, but making it benign is perhaps more complicated than previously suggested. As a reminder, the core of the facet access and initialization code essentially looks like this (pseudocode-ish): // facet data accessor ... if (0 == _C_impsize) { // 1 mutex_lock (); if (_C_impsize) return _C_data; _C_data= get_facet_data (); // 2 ?? // 3 _C_impsize = 1; // 4 mutex_unlock (); } ?? // 5 return _C_data; // 6 ... with question marks for missing, necessary fixes. The compiler needs to be prevented from re-ordering both 2-4 and 1-6. Just for the sake of argument I can imagine an optimization that reorders the reads in 1-6: register x = _C_data; if (_C_impsize) return x; and if the loads are executed in this order, the caller will see a stale _C_data. First, the 2-4 writes need to be executed in the program order. This needs both a compiler barrier and a store-store memory barrier that will keep the writes ordered. Then, the reads in 1-6 need to be ordered such that _C_data is read after _C_impsize, via a compiler barrier and a load-load memory barrier that will preserve the program order of the loads. Various compilers provide these features in various forms, but at the moment we don't have a unified STDCXX API to implement this. Of course, I might be wrong. Input is appreciated. Thanks, Liviu
Re: Fwd: Re: STDCXX-1071 numpunct facet defect
On 10/03/12 11:10, Martin Sebor wrote: [...] I was just thinking of a few simple loops along the lines of: tl;dr: I consider the results of the multi-threaded performance tests (12S, Intel/AMD multicores) as coming from heavy contention in copying of reference-counted std::string objects. The long version: I went back to the original test case (attached, t.cpp). As a reminder, the program is a simplification of the library numpunct facet creation process, which can be compiled either with data caching enabled or disabled. I ran the perf tool (https://perf.wiki.kernel.org/index.php/Main_Page) on the program, and I noticed marked differences in the number of instructions and cycles (an order of magnitude) and also in the number of CPU-migrations and context-switches (way larger for the cached version): Performance counter stats for './t': 121168.026301 task-clock-msecs # 3.778 CPUs 6133377 context-switches # 0.051 M/sec 35457 CPU-migrations # 0.000 M/sec 298 page-faults # 0.000 M/sec 338006437277 cycles # 2789.568 M/sec 136554855096 instructions # 0.404 IPC 31298346444 branches #258.305 M/sec 2305761092 branch-misses# 7.367 % 1592157186 cache-references # 13.140 M/sec 1729755 cache-misses # 0.014 M/sec 32.070381363 seconds time elapsed vs. Performance counter stats for './t': 16194.235282 task-clock-msecs # 3.883 CPUs 7491 context-switches # 0.000 M/sec 47 CPU-migrations # 0.000 M/sec 311 page-faults # 0.000 M/sec 45210281987 cycles # 2791.752 M/sec 76784904688 instructions # 1.698 IPC 18891711522 branches # 1166.570 M/sec 3162251552 branch-misses# 16.739 % 3332839 cache-references # 0.206 M/sec 119424 cache-misses # 0.007 M/sec 4.170904279 seconds time elapsed Then, I have run more detailed reports on the runs. I noticed that the report for the cached version contains what appears to be thread synchronization overhead, whereas the non-cached version doesn't: # Events: 134K cycles # # Overhead Command Shared Object Symbol # ... .. ... # 50.69%t [kernel.kallsyms] [k] _raw_spin_lock 5.46%t libpthread-2.13.so [.] pthread_mutex_lock 4.16%t libpthread-2.13.so [.] __pthread_mutex_unlock 3.65%t libpthread-2.13.so [.] __lll_lock_wait 3.35%t [kernel.kallsyms] [k] copy_user_generic_string 3.09%t 40395b [.] 40395b 2.91%t [kernel.kallsyms] [k] hash_futex 2.87%t [kernel.kallsyms] [k] futex_wake 2.67%t t [.] f 2.37%t [kernel.kallsyms] [k] futex_wait_setup 1.61%t [kernel.kallsyms] [k] system_call 1.00%t [kernel.kallsyms] [k] system_call_after_swapgs 0.98%t [kernel.kallsyms] [k] __switch_to 0.96%t libpthread-2.13.so [.] __lll_unlock_wake 0.95%t [kernel.kallsyms] [k] get_futex_key 0.92%t libc-2.13.so[.] __strlen_sse42 [...] vs. # Events: 15K cycles # # Overhead Command Shared Object Symbol # ... .. .. # 49.55% :295717f1ed3830a86 [.] 7f1ed3830a86 8.67%t libc-2.13.so[.] __malloc 8.03%t libc-2.13.so[.] cfree 4.88%t t [.] std::basic_stringchar, std::char_traitschar, std::allocatorchar ::basic_string(char const*, std::allocatorchar const) 4.67%t libc-2.13.so[.] __strlen_sse42 4.14%t libpthread-2.13.so [.] __pthread_mutex_unlock 3.62%t libpthread-2.13.so [.] pthread_mutex_lock 2.64%t t [.] f 2.40%t libc-2.13.so[.] _int_malloc 2.10%t libc-2.13.so[.] _int_free 1.53%t libpthread-2.13.so [.] __pthread_mutex_init_internal 1.33%t t [.] operator new(unsigned long) 1.08%t t [.] X::numpunct::do_grouping() const 1.04%t libc-2.13.so[.] __memcpy_sse2
Re: [PATCH] STDCXX-1073
On 10/21/12 19:08, Martin Sebor wrote: There's no requirement that embedded NULs must be preserved (that's just how it happens to be implemented). I find it best to avoid relying on the knowledge of implementation details when exercising the library so that tests don't start failing after a conforming optimization or some such tweak is added to the code. That's right. I'll improve it. Thanks for the review. Liviu
std::string reference counting performance
Hi all, I ran the attached test case, s.cpp, and timed it: * Deep copy $ for f in 16 8 6 4 2 1; do time ./s $f 1; done 16 0m39.292s 2m34.381s 0m0.035s 8 0m20.573s 1m18.131s 0m0.025s 6 0m15.037s 0m58.269s 0m0.009s 4 0m9.973s0m38.657s 0m0.009s 2 0m8.750s0m17.473s 0m0.016s 1 0m8.499s0m8.495s0m0.002s * Shallow copy $ for f in 16 8 6 4 2 1; do time ./s $f 1; done 16 5m29.311s 4m2.107s17m39.634s 8 2m43.635s 2m1.823s8m50.454s 6 2m10.601s 2m2.316s6m18.462s 4 1m20.766s 1m35.383s 3m29.467s 2 0m25.146s 0m27.775s 0m22.408s 1 0m3.557s0m3.555s0m0.000s The results show a counter-intuitive better performance for deep-copied strings. Again, there is the reversal of performance for one-threaded programs where there is no contention. Liviu #include iostream #include locale #include cstdio #include cstdlib #include cstring #include pthread.h #include unistd.h #define MAX_THREADS 128 static long nloops = 1000, nthreads = 16; static bool volatile pwait = true; extern C { static void* f (void* pv) { std::string const s = *reinterpret_cast std::string* (pv); volatile unsigned long n = 0; while (pwait) ; for (int i = 0; i nloops; ++i) { #if !defined (NO_DEEP_COPY) const std::string tmp (s.c_str ()); #else const std::string tmp = s; #endif // NO_DEEP_COPY n += strlen (tmp.c_str ()); } return (void*)n; } } // extern C int main (int argc, char** argv) { switch (argc) { case 3: nloops = atol (argv [2]); case 2: nthreads = atol (argv [1]); break; } pthread_t tid [MAX_THREADS] = { 0 }; if (nthreads MAX_THREADS) nthreads = MAX_THREADS; printf (%ld, %ld\n, nthreads, nloops); pthread_setconcurrency (nthreads); std::string s (Hello, world.); for (int i = 0; i nthreads; ++i) { if (pthread_create (tid + i, 0, f, s)) exit (-1); } usleep (50); pwait = false; for (int i = 0; i nthreads; ++i) { if (tid [i]) pthread_join (tid [i], 0); } return 0; } * Deep-copy $ for f in 16 8 6 4 2 1; do time ./s $f 1; done 16 0m39.292s 2m34.381s 0m0.035s 8 0m20.573s 1m18.131s 0m0.025s 6 0m15.037s 0m58.269s 0m0.009s 4 0m9.973s0m38.657s 0m0.009s 2 0m8.750s0m17.473s 0m0.016s 1 0m8.499s0m8.495s0m0.002s * Ref-counted 16 5m29.311s 4m2.107s17m39.634s 8 2m43.635s 2m1.823s8m50.454s 6 2m10.601s 2m2.316s6m18.462s 4 1m20.766s 1m35.383s 3m29.467s 2 0m25.146s 0m27.775s 0m22.408s 1 0m3.557s0m3.555s0m0.000s
Re: std::string reference counting performance
On 10/21/12 21:11, Liviu Nicoara wrote: Hi all, I ran the attached test case, s.cpp, and timed it: Gosh darn it, attached the wrong files. Here are the right ones. Liviu * Deep-copy $ for f in 16 8 6 4 2 1; do time ./s $f 1; done 16 0m39.292s 2m34.381s 0m0.035s 8 0m20.573s 1m18.131s 0m0.025s 6 0m15.037s 0m58.269s 0m0.009s 4 0m9.973s0m38.657s 0m0.009s 2 0m8.750s0m17.473s 0m0.016s 1 0m8.499s0m8.495s0m0.002s * Ref-counted 16 5m29.311s 4m2.107s17m39.634s 8 2m43.635s 2m1.823s8m50.454s 6 2m10.601s 2m2.316s6m18.462s 4 1m20.766s 1m35.383s 3m29.467s 2 0m25.146s 0m27.775s 0m22.408s 1 0m3.557s0m3.555s0m0.000s #include iostream #include locale #include cstdio #include cstdlib #include cstring #include pthread.h #include unistd.h #define MAX_THREADS 128 static long nloops = 1000, nthreads = 16; static bool volatile pwait = true; extern C { static void* f (void* pv) { std::string const s = *reinterpret_cast std::string* (pv); volatile unsigned long n = 0; while (pwait) ; for (int i = 0; i nloops; ++i) { #if !defined (NO_DEEP_COPY) const std::string tmp (s.c_str ()); #else const std::string tmp = s; #endif // NO_DEEP_COPY n += strlen (tmp.c_str ()); } return (void*)n; } } // extern C int main (int argc, char** argv) { switch (argc) { case 3: nloops = atol (argv [2]); case 2: nthreads = atol (argv [1]); break; } pthread_t tid [MAX_THREADS] = { 0 }; if (nthreads MAX_THREADS) nthreads = MAX_THREADS; printf (%ld, %ld\n, nthreads, nloops); pthread_setconcurrency (nthreads); std::string s (Hello, world.); for (int i = 0; i nthreads; ++i) { if (pthread_create (tid + i, 0, f, s)) exit (-1); } usleep (50); pwait = false; for (int i = 0; i nthreads; ++i) { if (tid [i]) pthread_join (tid [i], 0); } return 0; }
Re: std::string reference counting performance
On 10/21/12 21:44, Liviu Nicoara wrote: On 10/21/12 21:11, Liviu Nicoara wrote: Hi all, I ran the attached test case, s.cpp, and timed it: Gosh darn it, attached the wrong files. Here are the right ones. Made the same mistake twice. My apologies for the clutter. Here are the right ones. Liviu * Deep-copy $ for f in 16 8 4 2 1; do time ./s $f 5000; done 16, 50000m9.438s2m15.615s 0m10.799s 8, 50000m6.973s0m55.579s 0m0.006s 4, 50000m6.942s0m27.505s 0m0.001s 2, 50000m6.888s0m13.753s 0m0.002s 1, 50000m6.800s0m6.799s0m0.001s * Ref-counted $ for f in 16 8 4 2 1; do time ./s $f 5000; done 16, 50002m19.098s 2m36.278s 30m38.488s 8, 50001m47.757s 2m7.278s10m18.418s 4, 50004m27.425s 5m56.620s 10m13.892s 2, 50000m17.667s 0m22.560s 0m10.695s 1, 50000m2.923s0m2.921s0m0.001s #include iostream #include locale #include cstdio #include cstdlib #include cstring #include pthread.h #include unistd.h #define MAX_THREADS 128 static long nloops = 1000, nthreads = 16; static bool volatile pwait = true; extern C { static void* f (void* pv) { std::string const s = *reinterpret_cast std::string* (pv); volatile unsigned long n = 0; while (pwait) ; for (int i = 0; i nloops; ++i) { #if !defined (NO_DEEP_COPY) const std::string tmp (s.c_str (), s.size ()); #else const std::string tmp = s; #endif // NO_DEEP_COPY n += strlen (tmp.c_str ()); } return (void*)n; } } // extern C int main (int argc, char** argv) { switch (argc) { case 3: nloops = atol (argv [2]); case 2: nthreads = atol (argv [1]); break; } pthread_t tid [MAX_THREADS] = { 0 }; if (nthreads MAX_THREADS) nthreads = MAX_THREADS; printf (%ld, %ld\n, nthreads, nloops); pthread_setconcurrency (nthreads); std::string s (Hello, world.); for (int i = 0; i nthreads; ++i) { if (pthread_create (tid + i, 0, f, s)) exit (-1); } usleep (50); pwait = false; for (int i = 0; i nthreads; ++i) { if (tid [i]) pthread_join (tid [i], 0); } return 0; }
Re: Performance tests
On 10/18/12 10:49, Martin Sebor wrote: On 10/16/2012 10:45 AM, Liviu Nicoara wrote: Are there any performance tests (and measurements) of STDCXX features against similar features in libc? I don't think so. A long time ago I benchmarked stdcxx iostreams and stdio on a variety of operating systems but I'm not sure where those results are. They were interesting, though. On most systems, stdcxx iostreams was either on par or even faster than stdio. The one OS I remember not being able to beat was GLIBC. That is what I remember seeing a long time ago. I looked in my old mailboxes but I could not find it. Liviu
Re: [PATCH] STDCXX-1073
On 10/18/12 11:02, Martin Sebor wrote: On 10/16/2012 10:38 AM, Liviu Nicoara wrote: I have enhanced the collation test, 22.locale.collate.cpp with a bunch of cases that deal with embedded strings, inside the input strings as well as at the edges. More defects became apparent, and they have been fixed. I have re-worked the src/collate.cpp patch and I am attaching it here. All collation tests (old and new) pass. If there are no objections, I will check it in later in the week. There are tabs in the patch (we need spaces). They also make the patch harder to review than it otherwise would be. I would also suggest keeping the MSVC 8.0 block and cleaning things up separately. That will also make the patch easier to review. (I'm so swamped that this matters to me.) I looked at both (latest) patches and they do not have tabs in them. They were sent in my last post on the 16th, the one you replied to. The diff without -w is quite a bit verbose because I moved a whole block in both collate.cpp functions, inside an if, so there are whitespace differences. svn diff -w gave me a quite decent view of the changes without having to looks at the actual patched source file. I am re-attaching the patches, this time created without -w so they would be a bit verbose. No tabs this time either. I haven't looked at the test patch yet. If you could clean up the tab issue and reduce the amount of unnecessary whitespace changes I promise to review it this weekend. The test enhancements are pretty plain, just a bunch of added stuff. Thanks, Liviu Index: tests/localization/22.locale.collate.cpp === --- tests/localization/22.locale.collate.cpp(revision 1399886) +++ tests/localization/22.locale.collate.cpp(working copy) @@ -1029,7 +1029,105 @@ check_hash_eff (const char* charTname) template class charT void -check_NUL_locale (const char* charTname, const char* locname) +check_NUL_transform (const char* charTname, const char* locname, + const charT* src, size_t src_len, + const char* pat, size_t pat_len) +{ +std::locale loc (locname); + +typedef typename std::collatecharT Collate; +typedef typename Collate::string_type String; + +const Collate col = std::use_facetCollate (loc); + +String s = col.transform (src, src + src_len); + +for (size_t i = 0, j = 0; i pat_len; ++i) { + +if (j = s.size ()) { +rw_assert (false, __FILE__, __LINE__, + collate%s::transform (%{*.*Ac}) - %{*.*Ac} + , incomplete transformation (locale \%s\), + charTname, sizeof (charT), src_len, src, + sizeof (charT), s.size (), s.c_str (), locname); +break; +} + +switch (pat [i]) { +case 0: +rw_assert (j s.size () 0 == s [j], __FILE__, __LINE__, + collate%s::transform (\%{*.*Ac}\) - \%{*.*Ac}\ + , expected \\0, got %d in position %lu (locale \%s\), + charTname, sizeof (charT), src_len, src, sizeof (charT), + s.size (), s.c_str (), s [j], j, locname); +++j; +break; + +case '*': +rw_assert (j s.size () s [j], __FILE__, __LINE__, + collate%s::transform (\%{*.*Ac}\) - \%{*.*Ac}\ + , got NUL in position %lu (locale \%s\), charTname, + sizeof (charT), src_len, src, sizeof (charT), s.size (), + s.c_str (), j, locname); +for (; j s.size () s [j]; ++j) ; +break; + +default: +break; +} +} +} + +static void +check_NUL_transform_narrow (const char* charTname, const char* locname) +{ +#define T(src, pat) \ +check_NUL_transform (charTname, locname,\ + src, sizeof src / sizeof *src - 1, \ + pat, sizeof pat / sizeof *pat - 1) + +T (\0,\0 ); +T (\0\0, \0\0 ); +T (\0\0\0,\0\0\0 ); +T (a\0, *\0 ); +T (\0a\0, \0*\0); +T (\0a\0, \0*\0); +T (a\0\0, *\0\0); +T (a\0\0\0, *\0\0\0 ); +T (a\0b, *\0* ); +T (a\0b\0,*\0*\0 ); +T (a\0b\0\0, *\0*\0\0 ); +} + +static void +check_NUL_transform_wide (const char* charTname, const char* locname) +{ +T (L\0, \0 ); +T (L\0\0, \0\0 ); +T (L\0\0\0, \0\0\0 ); +T (La\0, *\0 ); +T (L\0a\0,\0*\0); +T (L\0a\0,\0*\0); +T (La\0\0,*\0\0); +T (La\0\0\0, *\0\0\0 ); +T (La\0b, *\0* ); +T (La\0b\0, *\0*\0 ); +T (La\0b\0\0, *\0*\0\0 ); + +#undef T +} + +template class charT +void +check_NUL_transform (const char
Re: [PATCH] STDCXX-1073
I have enhanced the collation test, 22.locale.collate.cpp with a bunch of cases that deal with embedded strings, inside the input strings as well as at the edges. More defects became apparent, and they have been fixed. I have re-worked the src/collate.cpp patch and I am attaching it here. All collation tests (old and new) pass. If there are no objections, I will check it in later in the week. Thanks, Liviu On 10/13/12 11:16, Liviu Nicoara wrote: The initial patch does not pass the following test case. Have re-worked the patch and attached it to the incident, and I am also attaching it here. It passes all collate tests. Index: tests/localization/22.locale.collate.cpp === --- tests/localization/22.locale.collate.cpp(revision 1397847) +++ tests/localization/22.locale.collate.cpp(working copy) @@ -1029,7 +1029,105 @@ check_hash_eff (const char* charTname) template class charT void -check_NUL_locale (const char* charTname, const char* locname) +check_NUL_transform (const char* charTname, const char* locname, + const charT* src, size_t src_len, + const char* pat, size_t pat_len) +{ +std::locale loc (locname); + +typedef typename std::collatecharT Collate; +typedef typename Collate::string_type String; + +const Collate col = std::use_facetCollate (loc); + +String s = col.transform (src, src + src_len); + +for (size_t i = 0, j = 0; i pat_len; ++i) { + +if (j = s.size ()) { +rw_assert (false, __FILE__, __LINE__, + collate%s::transform (%{*.*Ac}) - %{*.*Ac} + , incomplete transformation (locale \%s\), + charTname, sizeof (charT), src_len, src, + sizeof (charT), s.size (), s.c_str (), locname); +break; +} + +switch (pat [i]) { +case 0: +rw_assert (j s.size () 0 == s [j], __FILE__, __LINE__, + collate%s::transform (\%{*.*Ac}\) - \%{*.*Ac}\ + , expected \\0, got %d in position %lu (locale \%s\), + charTname, sizeof (charT), src_len, src, sizeof (charT), + s.size (), s.c_str (), s [j], j, locname); +++j; +break; + +case '*': +rw_assert (j s.size () s [j], __FILE__, __LINE__, + collate%s::transform (\%{*.*Ac}\) - \%{*.*Ac}\ + , got NUL in position %lu (locale \%s\), charTname, + sizeof (charT), src_len, src, sizeof (charT), s.size (), + s.c_str (), j, locname); +for (; j s.size () s [j]; ++j) ; +break; + +default: +break; +} +} +} + +static void +check_NUL_transform_narrow (const char* charTname, const char* locname) +{ +#define T(src, pat) \ +check_NUL_transform (charTname, locname,\ + src, sizeof src / sizeof *src - 1, \ + pat, sizeof pat / sizeof *pat - 1) + +T (\0,\0 ); +T (\0\0, \0\0 ); +T (\0\0\0,\0\0\0 ); +T (a\0, *\0 ); +T (\0a\0, \0*\0); +T (\0a\0, \0*\0); +T (a\0\0, *\0\0); +T (a\0\0\0, *\0\0\0 ); +T (a\0b, *\0* ); +T (a\0b\0,*\0*\0 ); +T (a\0b\0\0, *\0*\0\0 ); +} + +static void +check_NUL_transform_wide (const char* charTname, const char* locname) +{ +T (L\0, \0 ); +T (L\0\0, \0\0 ); +T (L\0\0\0, \0\0\0 ); +T (La\0, *\0 ); +T (L\0a\0,\0*\0); +T (L\0a\0,\0*\0); +T (La\0\0,*\0\0); +T (La\0\0\0, *\0\0\0 ); +T (La\0b, *\0* ); +T (La\0b\0, *\0*\0 ); +T (La\0b\0\0, *\0*\0\0 ); + +#undef T +} + +template class charT +void +check_NUL_transform (const char* charTname, const char* locname) +{ +check_NUL_transform_narrow (charTname, locname); +check_NUL_transform_wide (charTname, locname); +} + +template class charT +void +check_NUL_collate (const char* charTname, const char* locname) { std::locale loc (locname); @@ -1090,6 +1188,14 @@ check_NUL_locale (const char* charTname, template class charT void +check_NUL (const char* charTname, const char* locname) +{ +check_NUL_transform charT (charTname, locname); +check_NUL_collate charT(charTname, locname); +} + +template class charT +void check_NUL (const char* charTname) { // Verify that the collate facet correctly handles character @@ -1103,7 +1209,7 @@ check_NUL (const char* charTname) for (const char* locname = rw_locales (LC_COLLATE); *locname; locname += std::strlen (locname) + 1, ++i) { try { -check_NUL_localecharT (charTname, locname
Performance tests
Are there any performance tests (and measurements) of STDCXX features against similar features in libc? Thanks, Liviu
Re: [PATCH] STDCXX-970
On 10/10/12 07:40, Liviu Nicoara wrote: The following patch cleans up most of the failures seen in the collation test. It does not fix the transform failures and libstd tests. 2012-10-10 Liviu Nicoara lnico...@apache.org Various fixes: * tests/localization/22.locale.collate.cpp: removed unused macros, corrected assertions and their formatting. (c_xfrm): corrected use of safety buffers. (make_test_locale): used rw_create_locale, used the whole portable character set. (check_libc): split code, restored previous locales after test. (check_NUL): hard-coded positions of NUL characters. Committed revision 1397847.
Re: [PATCH] STDCXX-1073
The initial patch does not pass the following test case. Have re-worked the patch and attached it to the incident, and I am also attaching it here. It passes all collate tests. Here is the second test case: $ cat ../../tests/localization/t.cpp; nice make t.cpp ./t af_ZA.utf8; echo $? #include iostream #include locale #include string int main (int argc, char** argv) { std::locale loc (argv [1]); const std::collate char fac = std::use_facet std::collate char (loc); char const buf [] = a\0\0b; std::string s = fac.transform (buf, buf + sizeof buf - 1); size_t i = 0; for (; i s.size () 0 == s [i]; ++i) ; return !(i == 2); } 1 On 10/10/12 08:25, Liviu Nicoara wrote: 2012-10-10 Liviu Nicoara lnico...@apache.org * src/collate.cpp (__rw_strnxfrm): preserved embedded NULs Index: src/collate.cpp === --- src/collate.cpp (revision 1397825) +++ src/collate.cpp (working copy) @@ -480,113 +480,103 @@ { _STD::string res; -char buf [256]; -char *pbuf = buf; - +char buf [256], *pbuf = buf, *psrc = buf; size_t bufsize = sizeof buf; -char *psrc = buf; while (nchars) { -// using a C-style cast instead of static_cast to avoid -// a gcc 2.95.2 bug causing an error on some platforms: -// static_cast from `void *' to `const char *' -const char* const last = (const char*)memchr (src, '\0', nchars); - -if (0 == last) { - -// no NUL found in the initial portion of the source string -// that fits into the local temporary buffer; copy as many -// characters as fit into the buffer +if (src [0]) { -if (bufsize = nchars) { -if (pbuf != buf) -delete[] pbuf; -pbuf = new char [nchars + 1]; +// using a C-style cast instead of static_cast to avoid +// a gcc 2.95.2 bug causing an error on some platforms: +// static_cast from `void *' to `const char *' +const char* const last = (const char*)memchr (src, '\0', nchars); + +if (0 == last) { +// no NUL found in the initial portion of the source string +// that fits into the local temporary buffer; copy as many +// characters as fit into the buffer + +if (bufsize = nchars) { +if (pbuf != buf) +delete[] pbuf; +pbuf = new char [nchars + 1]; +} + +psrc = pbuf; +memcpy (psrc, src, nchars); + +// append a terminating NUL and decrement the number +// of characters that remain to be processed +psrc [nchars] = '\0'; +src += nchars; +nchars= 0; +} +else { +// terminating NUL found in the source buffer +nchars -= (last - src) + 1; +psrc= _RWSTD_CONST_CAST (char*, src); +src+= (last - src) + 1; } -psrc = pbuf; -memcpy (psrc, src, nchars); +#ifdef _RWSTD_OS_SUNOS +// Solaris 10u5 on AMD64 overwrites memory past the end of +// just_in_case_buf[8], to avoid this, pass a null pointer +char* const just_in_case_buf = 0; +#else +// provide a destination buffer to strxfrm() in case +// it's buggy (such as MSVC's) and tries to write to +// the buffer even if it's 0 +char just_in_case_buf [8]; +#endif // _RWSTD_OS_SUNOS -// append a terminating NUL and decrement the number -// of characters that remain to be processed -psrc [nchars] = '\0'; -src += nchars; -nchars= 0; -} -else { +const size_t dst_size = strxfrm (just_in_case_buf, psrc, 0); -// terminating NUL found in the source buffer -nchars -= (last - src) + 1; -psrc= _RWSTD_CONST_CAST (char*, src); -src+= (last - src) + 1; -} +// check for strxfrm() errors +if (0 == (dst_size 1)) { +if (pbuf != buf) +delete[] pbuf; -#ifdef _RWSTD_OS_SUNOS -// Solaris 10u5 on AMD64 overwrites memory past the end of -// just_in_case_buf[8], to avoid this, pass a null pointer -char* const just_in_case_buf = 0; -#else -// provide a destination buffer to strxfrm() in case -// it's buggy (such as MSVC's) and tries to write to -// the buffer even if it's 0 -char just_in_case_buf [8]; -#endif - -const size_t dst_size = strxfrm (just_in_case_buf, psrc, 0); - -// check for strxfrm() errors -if (0 == (dst_size 1
Re: [PATCH] Re: STDCXX-1072 SPARC V8 mutex alignment requirements
Thanks, Travis! On 10/13/12 15:36, Travis Vitek wrote: Liviu, I built the library and tests before and after your change with the 64-bit flag, and I saw no differences in the number of failed tests between the builds. I've attached the output of 'gmake -k runall' before and after your change to STDCXX-1072 in case you want to look over them. I wrote up the following test case to prove that I could reproduce the issue using the compile/link options used to build the stdcxx tests. The code is below.. [vitek@andromeda] 138 % cat t.cpp #include thread.h #include synch.h #include malloc.h int main () { int *ip; mutex_t *mp; ip = (int*)malloc(sizeof (int) + sizeof (mutex_t)); mp = (mutex_t*)(ip + 1); mutex_init(mp, USYNC_THREAD | LOCK_ROBUST, 0); mutex_lock(mp); mutex_unlock(mp); mutex_destroy(mp); free(ip); } [vitek@andromeda] 139 % gmake t CC -c -D_RWSTDDEBUG -mt -I/amd/homes/vitek/tmp/stdcxx-4.2.x/include \ -I/build/vitek/stdcxx-4.2.x_sunpro-5.8_sunos-5.10-patched/include \ -I/amd/homes/vitek/tmp/stdcxx-4.2.x/tests/include \ -library=%none -g -m64 +w -errtags -erroff=hidef t.cpp CC t.o -o t -L/build/vitek/stdcxx-4.2.x_sunpro-5.8_sunos-5.10-patched/rwtest \ -lrwtest15S -library=%none -mt -m64 \ -L/build/vitek/stdcxx-4.2.x_sunpro-5.8_sunos-5.10-patched/lib \ -lstd15S -lm Bus Error (core dumped) [vitek@andromeda] 140 % I also tested with POSIX mutexes and saw the same behavior. Travis From: Liviu Nicoara Sent: Thursday, October 11, 2012 5:28 AM To: dev@stdcxx.apache.org Subject: Re: [PATCH] Re: STDCXX-1072 SPARC V8 mutex alignment requirements I applied the patch on 4.2.x. If someone with access to a SPARC machine could give it a runall and post the results here it would be awesome. I will postpone closing the incident until then. Thanks! Liviu
Re: Fwd: Re: STDCXX-1071 numpunct facet defect
On 10/04/12 22:41, Liviu Nicoara wrote: On 10/4/12 10:10 PM, Liviu Nicoara wrote: On 10/3/12 11:10 AM, Martin Sebor wrote: On 10/03/2012 07:01 AM, Liviu Nicoara wrote: void* thread_func (void*) { for (int i = 0; i N; ++) test 1: do some simple stuff inline test 2: call a virtual function to do the same stuff test 3: lock and unlock a mutex and do the same stuff } Test 1 should be the fastest and test 3 the slowest. This should hold regardless of what simple stuff is (eventually, even when it's getting numpunct::grouping() data). That is expected; I attached test case x.cpp and results-x.txt. I did not find it too interesting in its own, though. The difference between the cached and non-cached data is that in the case of the cached data the copying of the string involves nothing more than a bump in the reference counter, whereas in the non-cached version a string object is constructed anew, and memory gets allocated for its body. Yet, in my measurements, the cached version is the one which shows the worse performance. So, I extracted the std::string class and simplified it down and put it in another test case. That would be u.cpp and the results are results-u.txt. The results show the same performance trends although the absolute values have skewed. Will get back to this after I digest the results a bit more. This discussion has stagnated lately. I hope to allocate some time this week-end to go over the results, re-verify them and analyze the disassembly. In the meantime, any thoughts? Liviu
Re: [PATCH] Re: STDCXX-1072 SPARC V8 mutex alignment requirements
I applied the patch on 4.2.x. If someone with access to a SPARC machine could give it a runall and post the results here it would be awesome. I will postpone closing the incident until then. Thanks! Liviu On 10/06/12 16:56, Liviu Nicoara wrote: On 09/29/12 15:33, Liviu Nicoara wrote: On 9/28/12 11:32 AM, Travis Vitek wrote: [...] I think an even cleaner solution is to switch to using __rw_aligned_buffer instead. It gives us a single point of failure for alignment issues like this, and it makes the code self-documenting and easier to read. I am attaching another patch here, which makes use of the __rw_aligned_buffer, slightly more verbose but the code is slightly cleaner. If there are no objections, I would check it in sometime after this week-end. Liviu
[PATCH] STDCXX-970
The following patch cleans up most of the failures seen in the collation test. It does not fix the transform failures and libstd tests. 2012-10-10 Liviu Nicoara lnico...@apache.org Various fixes: * tests/localization/22.locale.collate.cpp: removed unused macros, corrected assertions and their formatting. (c_xfrm): corrected use of safety buffers. (make_test_locale): used rw_create_locale, used the whole portable character set. (check_libc): split code, restored previous locales after test. (check_NUL): hard-coded positions of NUL characters. Index: tests/localization/22.locale.collate.cpp === --- tests/localization/22.locale.collate.cpp(revision 1392832) +++ tests/localization/22.locale.collate.cpp(working copy) @@ -43,6 +43,9 @@ #include rw_locale.h #include rw_process.h +#define IGNORE0 +#define STR_SIZE 16 +#define LOCNAME_SIZE 256 #if _RWSTD_PATH_SEP == '/' # define SLASH / @@ -59,10 +62,6 @@ #define LOCALE_ROOT RWSTD_LOCALE_ROOT const char* locale_root; -#define LC_COLLATE_SRC LC_COLLATE.src -#define LC_COLLATE_CM LC_COLLATE.cm -#define TEST_LOCALE_NAMEtest.locale - /**/ // These overloads are necessary in our template @@ -77,18 +76,18 @@ std::size_t c_xfrm (char* to, const char* from, std::size_t size) { -char safety_buf [8]; +char safety_buf [8] = { 0 }; + if (0 == to 0 == size) { // prevent buggy implementations (such as MSVC 8) from trying // to write to the destination buffer even though it's 0 and // its size is zero (see stdcxx-69) to = safety_buf; -*to = '\0'; } std::size_t n = std::strxfrm (to, from, size); -if (to) +if (to to != safety_buf) n = std::strlen (to); return n; @@ -127,45 +126,42 @@ std::size_t c_xfrm (wchar_t* to, const wchar_t* from, std::size_t size) { +std::size_t n = 0; + #if !defined (_MSC_VER) || _MSC_VER 1200 -wchar_t safety_buf [8]; +wchar_t safety_buf [8] = { 0 }; + if (0 == to 0 == size) { // prevent buggy implementations (such as MSVC 8) from trying // to write to the destination buffer even though it's 0 and // its size is zero (see stdcxx-69) to = safety_buf; -*to = L'\0'; } -std::size_t n = std::wcsxfrm (to, from, size); +n = std::wcsxfrm (to, from, size); -if (to) +if (to to != safety_buf) n = std::wcslen (to); #else // MSVC 6 and prior // working around an MSVC 6.0 libc bug (PR #26437) - if (to) { -std::size_t n = std::wcsxfrm (to, from, size); - +std::wcsxfrm (to, from, size); n = std::wcslen (to); - -return n; } +else { +wchar_t tmp [1024]; -wchar_t tmp [1024]; - -std::size_t n = std::wcslen (from); - -_RWSTD_ASSERT (n sizeof tmp / sizeof *tmp); - -std::wcscpy (tmp, from); - -std::wcsxfrm (tmp, from, sizeof tmp / sizeof *tmp); - -n = std::wcslen (tmp); +n = std::wcslen (from); +_RWSTD_ASSERT (n sizeof tmp / sizeof *tmp); + +std::wcscpy (tmp, from); +std::wcsxfrm (tmp, from, sizeof tmp / sizeof *tmp); + +n = std::wcslen (tmp); +} #endif // MSVC 6 @@ -226,7 +222,7 @@ /**/ template class charT -/*static*/ void +void gen_str (charT* str, std::size_t size) { // generate a random string with the given size @@ -234,7 +230,7 @@ return; // use ASCII characters in the printable range -for (std::size_t i = 0; i != size - 1; ++i) +for (std::size_t i = 0; i size - 1; ++i) str [i] = ' ' + std::rand () % ('~' - ' '); str [size - 1] = charT (); @@ -243,152 +239,165 @@ /**/ template class charT -/*static*/ void -check_libc (const char* charTname) +void +check_libc_locale (const char* charTname, char const* locname, + int (nfail) [3]) { -// the libc implementation of the library should act the same as -// the c-library. Go through all the locales, generate some random -// strings and make sure that the following holds true: -// transform acts like strxfrm and wcsxfrm, -// compare acts like strcoll and wcscoll +typedef std::char_traitscharT traits_type; +typedef std::allocatorcharT allocator_type; +typedef std::basic_string charT, traits_type, allocator_type string_type; -int nfail [3] = { 0 }; +std::locale loc (locname); -rw_info (0, __FILE__, __LINE__, - libc std::collate%s::transform (), charTname); +const std::collatecharT co = +_STD_USE_FACET (std::collatecharT, loc
[PATCH] STDCXX-1073
2012-10-10 Liviu Nicoara lnico...@apache.org * src/collate.cpp (__rw_strnxfrm): preserved embedded NULs Index: src/collate.cpp === --- src/collate.cpp (revision 1392832) +++ src/collate.cpp (working copy) @@ -547,7 +547,7 @@ _TRY { // resize the result string to fit itself plus the result -// of the transformation including the terminatin NUL +// of the transformation including the terminating NUL // appended by strxfrm() res.resize (res_size + dst_size + 1); } @@ -557,36 +557,15 @@ _RETHROW; } -// transfor the source string up to the terminating NUL -size_t xfrm_size = -strxfrm (res [0] + res_size, psrc, dst_size + 1); - -#if defined _MSC_VER _MSC_VER 1400 -// compute the correct value that should have been returned from -// strxfrm() after the transformation has completed (MSVC strxfrm() -// returns a bogus result; see PR #29935) -xfrm_size = strlen (res [0] + res_size); -#endif // MSVC 8.0 - -// increment the size of the result string by the number -// of transformed characters excluding the terminating NUL -// if strxfrm() transforms the empty string into the empty -// string, keep the terminating NUL, otherwise drop it -res_size += xfrm_size + (last !*psrc !xfrm_size); - -_TRY { -res.resize (res_size); -} -_CATCH (...) { -if (pbuf != buf) -delete[] pbuf; -_RETHROW; -} +strxfrm (res [0] + res_size, psrc, dst_size + 1); } if (pbuf != buf) delete[] pbuf; +if (!res.empty ()) +res.resize (res.size () - 1); + return res; }
Re: Fwd: Re: STDCXX-1071 numpunct facet defect
On 10/3/12 11:10 AM, Martin Sebor wrote: On 10/03/2012 07:01 AM, Liviu Nicoara wrote: I am gathering some more measurements along these lines but it's time consuming. I estimate I will have some ready for review later today or tomorrow. In the meantime could you please post your kernel, glibc and compiler versions? I was just thinking of a few simple loops along the lines of: void* thread_func (void*) { for (int i = 0; i N; ++) test 1: do some simple stuff inline test 2: call a virtual function to do the same stuff test 3: lock and unlock a mutex and do the same stuff } Test 1 should be the fastest and test 3 the slowest. This should hold regardless of what simple stuff is (eventually, even when it's getting numpunct::grouping() data). That is expected; I attached test case x.cpp and results-x.txt. I did not find it too interesting in its own, though. The difference between the cached and non-cached data is that in the case of the cached data the copying of the string involves nothing more than a bump in the reference counter, whereas in the non-cached version a string object is constructed anew, and memory gets allocated for its body. Yet, in my measurements, the cached version is the one which shows the worse performance. So, I extracted the std::string class and simplified it down and put it in another test case. That would be u.cpp and the results are results-u.txt. The results show the same performance trends although the absolute values have skewed. Will get back to this after I digest the results a bit more. Liviu -*- mode: org -*- * iMac, 4x Core i5 , 12S, gcc 4.5.4: $ nice make u 16, 1 1m18.811s 5m12.329s 0m0.263s 8, 10m39.919s 2m36.198s 0m0.150s 4, 10m20.449s 1m20.797s 0m0.050s 2, 10m9.888s0m19.725s 0m0.005s 1, 10m2.483s0m2.480s0m0.002s $ nice make CPPOPTS=-DNO_CACHE u 16, 1 0m37.418s 2m27.822s 0m0.872s 8, 10m18.844s 1m14.607s 0m0.261s 4, 10m10.165s 0m40.147s 0m0.023s 2, 10m8.652s0m17.278s 0m0.003s 1, 10m8.482s0m8.473s0m0.007s $ nice make CPPOPTS=-DNO_VIRTUAL_CALL u 16, 1 1m2.770s4m9.307s0m0.179s 8, 10m31.890s 2m6.792s0m0.087s 4, 10m16.427s 1m5.133s0m0.039s 2, 10m8.497s0m16.981s 0m0.007s 1, 10m2.291s0m2.288s0m0.002s $ nice make CPPOPTS=-DNO_CACHE -DNO_VIRTUAL_CALL u 16, 1 0m35.838s 2m21.406s 0m0.877s 8, 10m19.007s 1m14.920s 0m0.255s 4, 10m10.099s 0m39.504s 0m0.042s 2, 10m8.599s0m17.190s 0m0.003s 1, 10m8.986s0m8.980s0m0.005s * Linux Slackware, 16x AMD Opteron, 12S, gcc 4.5.2 $ nice make u $ nice make CPPOPTS=-DNO_CACHE u $ nice make CPPOPTS=-DNO_VIRTUAL_CALL u $ nice make CPPOPTS=-DNO_CACHE -DNO_VIRTUAL_CALL u -*- mode: org -*- * iMac, 4x Core i5 , 12S, gcc 4.5.4: $ nice make CPPOPTS=-DNO_LOCK -DNO_VIRTUAL_CALL u 16, 1 0m7.864s0m30.259s 0m0.035s 8, 1 0m4.396s0m17.034s 0m0.016s 4, 1 0m2.729s0m10.473s 0m0.011s 2, 1 0m2.481s0m4.929s0m0.003s 1, 1 0m2.461s0m2.455s0m0.002s $ nice make CPPOPTS=-DNO_LOCK u 16, 1 0m9.724s0m37.455s 0m0.043s 8, 1 0m5.559s0m20.309s 0m0.048s 4, 1 0m3.160s0m12.213s 0m0.013s 2, 1 0m2.872s0m5.694s0m0.004s 1, 1 0m2.845s0m2.838s0m0.002s $ nice make u 16, 1 1m3.745s3m58.570s 0m0.351s 8, 1 0m32.351s 1m55.740s 0m0.203s 4, 1 0m16.852s 1m1.633s0m0.092s 2, 1 0m8.419s0m16.699s 0m0.010s 1, 1 0m4.214s0m4.179s0m0.005s * Linux Slackware, 16x AMD Opteron, 12S, gcc 4.7.1 $ nice make CPPOPTS=-DNO_LOCK -DNO_VIRTUAL_CALL u 16, 1 0m4.382s1m9.896s0m0.004s 8, 1 0m4.374s0m34.904s 0m0.002s 4, 1 0m4.368s0m17.445s 0m0.002s 2, 1 0m4.366s0m8.720s0m0.003s 1, 1 0m4.355s0m4.351s0m0.001s $ nice make CPPOPTS=-DNO_LOCK u 16, 1 0m5.415s1m19.833s 0m0.005s 8, 1 0m4.939s0m39.438s 0m0.003s 4, 1 0m4.936s0m19.712s 0m0.001s 2, 1 0m4.930s0m9.847s0m0.002s 1, 1 0m4.921s0m4.917s0m0.002s $ nice make u 16, 1 1m40.769s 24m17.198s 0m0.006s 8, 1 0m51.702s 6m15.400s
Re: Fwd: Re: STDCXX-1071 numpunct facet defect
On 10/4/12 10:10 PM, Liviu Nicoara wrote: On 10/3/12 11:10 AM, Martin Sebor wrote: On 10/03/2012 07:01 AM, Liviu Nicoara wrote: I am gathering some more measurements along these lines but it's time consuming. I estimate I will have some ready for review later today or tomorrow. In the meantime could you please post your kernel, glibc and compiler versions? I was just thinking of a few simple loops along the lines of: void* thread_func (void*) { for (int i = 0; i N; ++) test 1: do some simple stuff inline test 2: call a virtual function to do the same stuff test 3: lock and unlock a mutex and do the same stuff } Test 1 should be the fastest and test 3 the slowest. This should hold regardless of what simple stuff is (eventually, even when it's getting numpunct::grouping() data). That is expected; I attached test case x.cpp and results-x.txt. I did not find it too interesting in its own, though. The difference between the cached and non-cached data is that in the case of the cached data the copying of the string involves nothing more than a bump in the reference counter, whereas in the non-cached version a string object is constructed anew, and memory gets allocated for its body. Yet, in my measurements, the cached version is the one which shows the worse performance. So, I extracted the std::string class and simplified it down and put it in another test case. That would be u.cpp and the results are results-u.txt. The results show the same performance trends although the absolute values have skewed. Will get back to this after I digest the results a bit more. Attached some incomplete files previously. Corrected. Liviu -*- mode: org -*- * iMac, 4x Core i5 , 12S, gcc 4.5.4: $ nice make u 16, 1 1m18.811s 5m12.329s 0m0.263s 8, 10m39.919s 2m36.198s 0m0.150s 4, 10m20.449s 1m20.797s 0m0.050s 2, 10m9.888s0m19.725s 0m0.005s 1, 10m2.483s0m2.480s0m0.002s $ nice make CPPOPTS=-DNO_CACHE u 16, 1 0m37.418s 2m27.822s 0m0.872s 8, 10m18.844s 1m14.607s 0m0.261s 4, 10m10.165s 0m40.147s 0m0.023s 2, 10m8.652s0m17.278s 0m0.003s 1, 10m8.482s0m8.473s0m0.007s $ nice make CPPOPTS=-DNO_VIRTUAL_CALL u 16, 1 1m2.770s4m9.307s0m0.179s 8, 10m31.890s 2m6.792s0m0.087s 4, 10m16.427s 1m5.133s0m0.039s 2, 10m8.497s0m16.981s 0m0.007s 1, 10m2.291s0m2.288s0m0.002s $ nice make CPPOPTS=-DNO_CACHE -DNO_VIRTUAL_CALL u 16, 1 0m35.838s 2m21.406s 0m0.877s 8, 10m19.007s 1m14.920s 0m0.255s 4, 10m10.099s 0m39.504s 0m0.042s 2, 10m8.599s0m17.190s 0m0.003s 1, 10m8.986s0m8.980s0m0.005s * Linux Slackware, 16x AMD Opteron, 12S, gcc 4.5.2 $ nice make u 16, 1 2m7.644s30m10.014s 0m0.006s 8, 1 1m4.449s7m45.111s 0m0.004s 4, 1 0m31.646s 2m1.438s0m0.002s 2, 1 0m14.828s 0m29.637s 0m0.002s 1, 1 0m3.223s0m3.220s0m0.002s $ nice make CPPOPTS=-DNO_CACHE u 16, 1 0m10.730s 2m46.032s 0m2.204s 8, 1 0m9.922s1m18.892s 0m0.008s 4, 1 0m9.874s0m39.192s 0m0.003s 2, 1 0m9.652s0m19.268s 0m0.004s 1, 1 0m9.622s0m9.617s0m0.002s $ nice make CPPOPTS=-DNO_VIRTUAL_CALL u 16, 1 1m54.812s 27m5.957s 0m0.007s 8, 1 0m53.862s 6m26.391s 0m0.003s 4, 1 0m26.287s 1m39.991s 0m0.001s 2, 1 0m9.120s0m17.233s 0m0.002s 1, 1 0m3.004s0m3.002s0m0.001s $ nice make CPPOPTS=-DNO_CACHE -DNO_VIRTUAL_CALL u 16, 1 0m9.717s2m34.412s 0m0.401s 8, 1 0m9.613s1m16.762s 0m0.007s 4, 1 0m9.590s0m38.285s 0m0.004s 2, 1 0m9.629s0m19.165s 0m0.002s 1, 1 0m9.361s0m9.357s0m0.001s -*- mode: org -*- * iMac, 4x Core i5 , 12S, gcc 4.5.4: $ nice make CPPOPTS=-DNO_LOCK -DNO_VIRTUAL_CALL u 16, 1 0m7.864s0m30.259s 0m0.035s 8, 1 0m4.396s0m17.034s 0m0.016s 4, 1 0m2.729s0m10.473s 0m0.011s 2, 1 0m2.481s0m4.929s0m0.003s 1, 1 0m2.461s0m2.455s0m0.002s $ nice make CPPOPTS=-DNO_LOCK u 16, 1 0m9.724s0m37.455s 0m0.043s 8, 1 0m5.559s0m20.309s 0m0.048s 4, 1 0m3.160s0m12.213s 0m0.013s 2
Re: Fwd: Re: STDCXX-1071 numpunct facet defect
On 10/02/12 10:41, Martin Sebor wrote: I haven't had time to look at this since my last email on Sunday. I also forgot about the string mutex. I don't think I'll have time to spend on this until later in the week. Unless the disassembly reveals the smoking gun, I think we might need to simplify the test to get to the bottom of the differences in our measurements. (I.e., eliminate the library and measure the runtime of a simple thread loop, with and without locking.) We should also look at the GLIBC and kernel versions on our systems, on the off chance that there has been a change that could explain the discrepancy between my numbers and yours. I suspect my system (RHEL 4.8) is much older than yours (I don't remember now if you posted your details). I am gathering some more measurements along these lines but it's time consuming. I estimate I will have some ready for review later today or tomorrow. In the meantime could you please post your kernel, glibc and compiler versions? Liviu Martin On 10/02/2012 06:22 AM, Liviu Nicoara wrote: On 09/30/12 18:18, Martin Sebor wrote: I see you did a 64-bit build while I did a 32-bit one. so I tried 64-bits. The cached version (i.e., the one compiled with -UNO_USE_NUMPUNCT_CACHE) is still about twice as fast as the non-cached one (compiled with -DNO_USE_NUMPUNCT_CACHE). I had made one change to the test program that I thought might account for the difference: I removed the call to abort from the thread function since it was causing the process to exit prematurely in some of my tests. But since you used the modified program for your latest measurements that couldn't be it. I can't explain the differences. They just don't make sense to me. Your results should be the other way around. Can you post the disassembly of function f() for each of the two configurations of the test? The first thing that struck me in the cached `f' was that __string_ref class uses a mutex for synchronizing access to the ref counter. It turns out, for Linux on AMD64 we explicitly use a mutex instead of the atomic ops on the ref counter, via a block in rw/_config.h: # if _RWSTD_VER_MAJOR 5 # ifdef _RWSTD_OS_LINUX // on Linux/AMD64, unless explicitly requested, disable the use // of atomic operations in string for binary compatibility with // stdcxx 4.1.x # ifndef _RWSTD_USE_STRING_ATOMIC_OPS # define _RWSTD_NO_STRING_ATOMIC_OPS # endif // _RWSTD_USE_STRING_ATOMIC_OPS # endif // _WIN32 # endif // stdcxx 5.0 That is not the cause for the performance difference, though. Even after building with __RWSTD_USE_STRING_ATOMIC_OPS I get the same better performance with the non-cached version. Liviu
Re: Fwd: Re: STDCXX-1071 numpunct facet defect
On 09/30/12 18:18, Martin Sebor wrote: I see you did a 64-bit build while I did a 32-bit one. so I tried 64-bits. The cached version (i.e., the one compiled with -UNO_USE_NUMPUNCT_CACHE) is still about twice as fast as the non-cached one (compiled with -DNO_USE_NUMPUNCT_CACHE). I had made one change to the test program that I thought might account for the difference: I removed the call to abort from the thread function since it was causing the process to exit prematurely in some of my tests. But since you used the modified program for your latest measurements that couldn't be it. I can't explain the differences. They just don't make sense to me. Your results should be the other way around. Can you post the disassembly of function f() for each of the two configurations of the test? The first thing that struck me in the cached `f' was that __string_ref class uses a mutex for synchronizing access to the ref counter. It turns out, for Linux on AMD64 we explicitly use a mutex instead of the atomic ops on the ref counter, via a block in rw/_config.h: # if _RWSTD_VER_MAJOR 5 #ifdef _RWSTD_OS_LINUX // on Linux/AMD64, unless explicitly requested, disable the use // of atomic operations in string for binary compatibility with // stdcxx 4.1.x # ifndef _RWSTD_USE_STRING_ATOMIC_OPS #define _RWSTD_NO_STRING_ATOMIC_OPS # endif // _RWSTD_USE_STRING_ATOMIC_OPS #endif // _WIN32 # endif // stdcxx 5.0 That is not the cause for the performance difference, though. Even after building with __RWSTD_USE_STRING_ATOMIC_OPS I get the same better performance with the non-cached version. Liviu
Fwd: Re: STDCXX-1071 numpunct facet defect
Forwarding with the attachment. Original Message Subject: Re: STDCXX-1071 numpunct facet defect Date: Sun, 30 Sep 2012 12:09:10 -0600 From: Martin Sebor mse...@gmail.com To: Liviu Nicoara nikko...@hates.ms On 09/27/2012 06:36 PM, Liviu Nicoara wrote: On 9/27/12 8:27 PM, Martin Sebor wrote: On 09/27/2012 06:41 AM, Liviu Nicoara wrote: On 09/26/12 20:12, Liviu Nicoara wrote: I have created STDCXX-1071 and linked to STDCXX-1056. [...] I am open to all questions, the more the better. Most of my opinions have been expressed earlier, but please ask if you want to know more. I am attaching here the proposed (4.3.x) patch and the timings results (after re-verifying the correctness of the timing program and the results). The 4.2.x patch, the 4.3.x patch, the test program and the results file are also attached to the incident. The patch isn't binary compatible. We can't remove data members in a minor release. We could only do it in a major release. There are two of them, one for 4.2.x, one for 4.3.x. I'm still curious about the performance, though. It doesn't make sense to me that a call to a virtual function is faster than one to an almost trivial inline function. I scratched my head long on that. I wager that it depends on the machine, too. Here are my timings for library-reduction.cpp when compiled GCC 4.5.3 on Solaris 10 (4 SPARCV9 CPUs). I had to make a small number of trivial changes to get it to compile: With cache No cache real1m38.332s 8m58.568s user6m30.244s34m25.942s sys 0m0.060s 0m3.922s I also experimented with the program on Linux (CEL 4 with 16 CPUs). Initially, I saw no differences between the two versions. So I modified it a bit to make it closer to the library (the modified program is attached). With those changes the timings are below: With cache No cache real0m 1.107s0m 5.669s user0m17.204s0m 5.669s sys 0m 0.000s0m22.347s I also recompiled and re-ran the test on Solaris. To speed things along, I set the number threads and loops to 8 and 100. The numbers are as follows: With cache No cache real0m3.341s 0m26.333s user0m13.052s1m37.470s sys 0m0.009s 0m0.132s The numbers match my expectation. The overhead without the numpunct cache is considerable. Somewhat unexpectedly, the test with the cache didn't crash. I also tried timing the numpunct patch attached to the issue with the test program. The test program runs to completion without the patch but it crashes with a SIGSEGV after the patch is applied. That suggests there's a bug somewhere in do_thousands_sep() (in addition to the caching). Martin Let me know if there is anything I could help with. Liviu #include iostream #include locale #include cstdio #include cstdlib #include cstring #include pthread.h #include unistd.h #define MAX_THREADS 128 static long nloops = 1000, nthreads = 16; static bool volatile pwait = true; namespace X { struct guard { guard (pthread_mutex_t* ptr) : lock_ (ptr) { pthread_mutex_lock (lock_); } ~guard () { pthread_mutex_unlock (lock_); } private: guard (guard const); guard operator= (guard const); private: pthread_mutex_t* lock_; }; struct facet { facet () { pthread_mutex_init (_C_mutex, 0); } void* _C_data () { return _C_impsize ? _C_impdata : _C_get_data (); } void* _C_get_data (); size_t _C_impsize; void* _C_impdata; pthread_mutex_t _C_mutex; }; void* facet::_C_get_data () { if (_C_impsize) return _C_impdata; guard g (_C_mutex); if (_C_impsize) return _C_impdata; #if !defined (NO_USE_STDCXX_LOCALES) _C_impdata = (void*)\3\3; #endif // USE_STDCXX_LOCALES _C_impsize = (size_t)(-1); return _C_impdata; } void* get_data (facet*); struct numpunct : facet { numpunct () : _C_flags () { } virtual std::string do_grouping () const; std::string grouping () const { numpunct* const __self = (numpunct*)this; #if !defined (NO_USE_NUMPUNCT_CACHE) if (!(_C_flags 1)) { __self-_C_flags |= 1; __self-_C_grouping = do_grouping (); } return _C_grouping; #else return do_grouping (); #endif // NO_USE_NUMPUNCT_CACHE } private: int _C_flags; std::string _C_grouping; }; std::string numpunct::do_grouping () const { return (const char*)get_data (const_castnumpunct*(this)); } void* get_data (facet* pfacet) { void* pdata = pfacet-_C_data (); if (pdata) return pdata; // __rw_setlocale loc (...); - other mutex if (pfacet-_C_data ()) return get_data (pfacet); pfacet-_C_impdata = const_castchar*(\4\4); pfacet-_C_impsize = (size_t)(-1); return 0; } } // namespace X
Re: Fwd: Re: STDCXX-1071 numpunct facet defect
On 9/30/12 2:21 PM, Liviu Nicoara wrote: Forwarding with the attachment. Original Message Subject: Re: STDCXX-1071 numpunct facet defect Date: Sun, 30 Sep 2012 12:09:10 -0600 From: Martin Sebor mse...@gmail.com To: Liviu Nicoara nikko...@hates.ms On 9/27/12 8:27 PM, Martin Sebor wrote: Here are my timings for library-reduction.cpp when compiled GCC 4.5.3 on Solaris 10 (4 SPARCV9 CPUs). I had to make a small number of trivial changes to get it to compile: With cache No cache real1m38.332s 8m58.568s user6m30.244s34m25.942s sys 0m0.060s 0m3.922s I also experimented with the program on Linux (CEL 4 with 16 CPUs). Initially, I saw no differences between the two versions. So I modified it a bit to make it closer to the library (the modified program is attached). With those changes the timings I see the difference -- your program has a virtual function it calls from the inline grouping function. are below: With cache No cache real0m 1.107s0m 5.669s user0m17.204s0m 5.669s sys0m 0.000s0m22.347s I also recompiled and re-ran the test on Solaris. To speed things along, I set the number threads and loops to 8 and 100. The numbers are as follows: With cache No cache real0m3.341s 0m26.333s user0m13.052s1m37.470s sys 0m0.009s 0m0.132s The numbers match my expectation. The overhead without the numpunct cache is considerable. I have done another (smaller) round of measurements, this time using the test program you posted. Here are the results: * iMac, 4x Intel, 12S: 16, 1000: Cached Not cached real0m9.300s 0m5.224s user0m36.441s0m20.523s sys 0m0.043s 0m0.068s * iMac, 4x Intel, 12D: CachedNot cached real0m9.012s 0m5.774s user0m35.343s 0m20.997s sys 0m0.045s 0m0.183s * Linux Slackware, 16x AMD Opteron, 12S: 16, 1000: Cached Not cached real0m29.798s 0m3.278s user0m48.662s 0m47.338s sys 6m18.525s 0m3.298s Somewhat unexpectedly, the test with the cache didn't crash. On my iMac it did not crash for me either (gcc 4.5.4), this time. On the other box (gcc 4.5.2) crashed every time with caching, so I had to add a call to fac.grouping outside the thread function to initialize the facet. Liviu
Fwd: Re: Fwd: Re: STDCXX-1071 numpunct facet defect
Forwarding to the list. Duh. Original Message Subject: Re: Fwd: Re: STDCXX-1071 numpunct facet defect Date: Sun, 30 Sep 2012 19:02:27 -0400 From: Liviu Nicoara nikko...@hates.ms To: Martin Sebor mse...@gmail.com On 9/30/12 6:18 PM, Martin Sebor wrote: I see you did a 64-bit build while I did a 32-bit one. so I tried 64-bits. The cached version (i.e., the one compiled with -UNO_USE_NUMPUNCT_CACHE) is still about twice as fast as the non-cached one (compiled with -DNO_USE_NUMPUNCT_CACHE). I had made one change to the test program that I thought might account for the difference: I removed the call to abort from the thread function since it was causing the process to exit prematurely in some of my tests. But since you used the modified program for your latest measurements that couldn't be it. I can't explain the differences. They just don't make sense to me. Your results should be the other way around. Can you post the disassembly of function f() for each of the two configurations of the test? Here they are. Liviu Dump of assembler code for function f: 0x00403870 +0: push %r15 0x00403872 +2: push %r14 0x00403874 +4: push %r13 0x00403876 +6: push %r12 0x00403878 +8: push %rbp 0x00403879 +9: push %rbx 0x0040387a +10:mov%rdi,%rbx 0x0040387d +13:sub$0x38,%rsp 0x00403881 +17:nopl 0x0(%rax) 0x00403888 +24:movzbl 0x261f11(%rip),%eax# 0x6657a0 _ZL5pwait 0x0040388f +31:test %al,%al 0x00403891 +33:jne0x403888 f+24 0x00403893 +35:cmpq $0x0,0x261ef5(%rip)# 0x665790 _ZL6nloops 0x0040389b +43:jle0x403b12 f+674 0x004038a1 +49:xor%ebp,%ebp 0x004038a3 +51:xor%r12d,%r12d 0x004038a6 +54:lea0x10(%rsp),%r13 0x004038ab +59:lea0x48(%rbx),%r14 0x004038af +63:lea0x20(%rsp),%r15 0x004038b4 +68:jmpq 0x4039a7 f+311 0x004038b9 +73:nopl 0x0(%rax) 0x004038c0 +80:cmp$0x66a020,%rdx 0x004038c7 +87:mov%rdi,0x20(%rsp) 0x004038cc +92:je 0x403ab8 f+584 0x004038d2 +98:mov%rdx,%rdi 0x004038d5 +101: mov%rdx,(%rsp) 0x004038d9 +105: callq 0x403658 pthread_mutex_lock@plt 0x004038de +110: test %eax,%eax 0x004038e0 +112: mov(%rsp),%rdx 0x004038e4 +116: je 0x4038fb f+139 0x004038e6 +118: mov$0x4452a4,%esi 0x004038eb +123: mov$0xa,%edi 0x004038f0 +128: xor%eax,%eax 0x004038f2 +130: callq 0x404370 _ZN4__rw10__rw_throwEiz 0x004038f7 +135: mov(%rsp),%rdx 0x004038fb +139: addl $0x1,0x28(%rdx) 0x004038ff +143: test %rdx,%rdx 0x00403902 +146: je 0x40390c f+156 0x00403904 +148: mov%rdx,%rdi 0x00403907 +151: callq 0x4036c8 pthread_mutex_unlock@plt 0x0040390c +156: mov0x20(%rsp),%rdx 0x00403911 +161: mov%rdx,%rdi 0x00403914 +164: mov%rdx,(%rsp) 0x00403918 +168: callq 0x403258 strlen@plt 0x0040391d +173: mov(%rsp),%rdx 0x00403921 +177: add%rax,%r12 0x00403924 +180: lea-0x40(%rdx),%rcx 0x00403928 +184: cmp$0x66a020,%rcx 0x0040392f +191: je 0x40398b f+283 0x00403931 +193: mov%rcx,%rdi 0x00403934 +196: mov%rcx,0x8(%rsp) 0x00403939 +201: callq 0x403658 pthread_mutex_lock@plt 0x0040393e +206: test %eax,%eax 0x00403940 +208: mov(%rsp),%rdx 0x00403944 +212: mov0x8(%rsp),%rcx 0x00403949 +217: je 0x403965 f+245 0x0040394b +219: mov$0x4452a4,%esi 0x00403950 +224: mov$0xa,%edi 0x00403955 +229: xor%eax,%eax 0x00403957 +231: callq 0x404370 _ZN4__rw10__rw_throwEiz 0x0040395c +236: mov0x8(%rsp),%rcx 0x00403961 +241: mov(%rsp),%rdx 0x00403965 +245: mov-0x18(%rdx),%esi 0x00403968 +248: test %rcx,%rcx 0x0040396b +251: lea-0x1(%rsi),%eax 0x0040396e +254: mov%eax,-0x18(%rdx) 0x00403971 +257: je 0x403983 f+275 0x00403973 +259: mov%rcx,%rdi 0x00403976 +262: mov%esi,0x8(%rsp) 0x0040397a +266: callq 0x4036c8 pthread_mutex_unlock@plt 0x0040397f +271: mov0x8(%rsp),%esi 0x00403983 +275: test %esi,%esi 0x00403985 +277: jle0x403a80 f+528 0x0040398b +283: add$0x1,%ebp 0x0040398e +286: movq $0x0,0x20(%rsp
[PATCH] Re: STDCXX-1072 SPARC V8 mutex alignment requirements
On 9/28/12 11:32 AM, Travis Vitek wrote: -Original Message- From: Liviu Nicoara Sent: Friday, September 28, 2012 5:29 AM [...] The patch assumes the type `long double' exists on every platform. While I do believe that it is available everywhere, we have lots of conditional code guarding its use in the library now. If we are going to use `long double' in this context, we should guard it with _RWSTD_NO_LONG_DOUBLE. I think an even cleaner solution is to switch to using __rw_aligned_buffer instead. It gives us a single point of failure for alignment issues like this, and it makes the code self-documenting and easier to read. I am attaching another patch here, which makes use of the __rw_aligned_buffer, slightly more verbose but the code is slightly cleaner. Thanks! Liviu Index: src/messages.cpp === --- src/messages.cpp(revision 1390953) +++ src/messages.cpp(working copy) @@ -38,6 +38,7 @@ #include rw/_mutex.h +#include podarray.h #if !defined (_RWSTD_NO_NL_TYPES_H) !defined (_RWSTD_NO_CATOPEN_CATGETS) # include nl_types.h @@ -64,10 +65,7 @@ _RWSTD_NAMESPACE (__rw) { struct __rw_open_cat_data { nl_catd catd; -union { -void *_C_align; -char _C_data [sizeof (_STD::locale)]; -} loc; +__rw_aligned_buffer_STD::locale loc; }; struct __rw_open_cat_page @@ -159,7 +157,8 @@ __rw_manage_cat_data (int cat, __rw_op cat = int (n_catalogs); catalogs [cat]-catd = pcat_data-catd; -memcpy (catalogs [cat]-loc, pcat_data-loc, +memcpy (catalogs [cat]-loc._C_store (), +pcat_data-loc._C_store (), sizeof (_STD::locale)); if (size_t (cat) largest_cat) @@ -175,7 +174,8 @@ __rw_manage_cat_data (int cat, __rw_op } catalogs [cat]-catd = pcat_data-catd; -memcpy (catalogs [cat]-loc, pcat_data-loc, +memcpy (catalogs [cat]-loc._C_store (), +pcat_data-loc._C_store (), sizeof (_STD::locale)); if (size_t (cat) largest_cat) @@ -258,8 +258,9 @@ int __rw_cat_open (const _STD::string c return -1; __rw_open_cat_data cat_data; + cat_data.catd = catd; -new (cat_data.loc) _STD::locale (loc); +new (cat_data.loc._C_store ()) _STD::locale (loc); int cat = -1; __rw_manage_cat_data (cat, cat_data); @@ -307,7 +308,7 @@ const _STD::locale __rw_get_locale (int _RWSTD_ASSERT (0 != pcat_data); -return *(_RWSTD_REINTERPRET_CAST (_STD::locale*, (pcat_data-loc))); +return *pcat_data-loc._C_data (); } @@ -329,8 +330,7 @@ void __rw_cat_close (int cat) catclose (pcat_data-catd); -_STD::locale* const ploc = -_RWSTD_REINTERPRET_CAST (_STD::locale*, pcat_data-loc); +_STD::locale* const ploc = pcat_data-loc._C_data (); ploc-~locale (); Index: src/locale_body.cpp === --- src/locale_body.cpp (revision 1390953) +++ src/locale_body.cpp (working copy) @@ -1024,14 +1024,12 @@ _C_manage (__rw_locale *plocale, const c // the classic C locale is statically allocated // and not destroyed during the lifetime of the process -static union { -void* _C_align; -char _C_buf [sizeof (__rw_locale)]; -} classic_body; +static __rw_aligned_buffer__rw_locale classic_body; // construct a locale body in place // with the initial reference count of 1 -classic = new (classic_body) __rw_locale (locname); +classic = new (classic_body._C_store ()) +__rw_locale (locname); _RWSTD_ASSERT (1 == classic-_C_ref); } Index: src/use_facet.h === --- src/use_facet.h (revision 1390953) +++ src/use_facet.h (working copy) @@ -37,6 +37,7 @@ #include rw/_defs.h #include access.h +#include podarray.h // helper macro _RWSTD_DEFINE_FACET_FACTORY() defines a facet factory @@ -63,12 +64,9 @@ } \ else { \ /* construct an ordinary facet in static memory */ \ - static union { \ - void *align_; \ - char data_ [sizeof (__rw_ ## fid ## _facet)]; \ - } f
STDCXX-970 and locale tests
While checking for fallout from 1072 I stumbled upon the collate test and its numerous (unrelated to 1072) failures. While I am looking at it I have a question: it seems to me that the locale tests do not test against STDCXX locale database. Is that right? So, all locale tests are supposed to be run without RWSTD_LOCALE_ROOT defined? Thanks, Liviu
Re: STDCXX-1071 numpunct facet defect
I thought I replied but I see no trace of my post: On 09/27/12 20:27, Martin Sebor wrote: On 09/27/2012 06:41 AM, Liviu Nicoara wrote: On 09/26/12 20:12, Liviu Nicoara wrote: I have created STDCXX-1071 and linked to STDCXX-1056. [...] I am open to all questions, the more the better. Most of my opinions have been expressed earlier, but please ask if you want to know more. I am attaching here the proposed (4.3.x) patch and the timings results (after re-verifying the correctness of the timing program and the results). The 4.2.x patch, the 4.3.x patch, the test program and the results file are also attached to the incident. The patch isn't binary compatible. We can't remove data members in a minor release. We could only do it in a major release. I posted two patches, one for 4.3.x and one for 4.2.x (the latter is binary compatible). My understanding was that minor version revisions may break binary compatibility. Liviu
STDCXX-1072 SPARC V8 mutex alignment requirements
I have created the above and linked it to the closed STDCXX-1066. In short, my reading about this issue is that the kernel patch changed the alignment of the userland mutex objects from a machine word to a double-word boundary. No changes are required of the users who use such objects in their programs unless users create mutex objects in buffers which may not be aligned on a proper boundary. E.g., the following are safe: mutex_t lock; struct S { char misalign; mutex_t lock; }; whereas the following is not: union { void* align; char buf [sizeof mutex_t]; } u; new (u) mutex_t; because the alignment requirements for void pointer are less strict than for mutex_t. A few places in the library use the latter for all sorts of static objects (mostly local statics). I looked esp. for places where we build objects that contain mutex sub-objects inside a union-aligned buffer: struct S { char c; mutex_t m; }; ... union { void* align; // - incorrect char buf [sizeof (S)]; } u; new (u) S (); The alignment must be changed to a value equal or greater than the mutex alignment requirements. IMO, the patch I attached does not break binary compatibility. It uses a one size fits all long double for alignment -- like we use in rw/_mutex.h -- but in doing so dispenses with all sorts of preprocessor conditionals. I still don't have access to a SPARC machine. Any feed-back and/or SPARC build results are more than welcome! Thanks! Liviu Index: src/messages.cpp === --- src/messages.cpp(revision 1390953) +++ src/messages.cpp(working copy) @@ -65,7 +65,7 @@ struct __rw_open_cat_data { nl_catd catd; union { -void *_C_align; +long double _C_align; char _C_data [sizeof (_STD::locale)]; } loc; }; Index: src/locale_body.cpp === --- src/locale_body.cpp (revision 1390953) +++ src/locale_body.cpp (working copy) @@ -1025,7 +1025,7 @@ _C_manage (__rw_locale *plocale, const c // the classic C locale is statically allocated // and not destroyed during the lifetime of the process static union { -void* _C_align; +long double _C_align; char _C_buf [sizeof (__rw_locale)]; } classic_body; Index: src/use_facet.h === --- src/use_facet.h (revision 1390953) +++ src/use_facet.h (working copy) @@ -64,7 +64,7 @@ else { \ /* construct an ordinary facet in static memory */ \ static union { \ - void *align_; \ + long double align_; \ char data_ [sizeof (__rw_ ## fid ## _facet)]; \ } f;\ static __rw_facet* const pf = \ @@ -91,7 +91,7 @@ { \ /* construct an ordinary facet in static memory */ \ static union { \ - void *align_; \ + long double align_; \ char data_ [sizeof (__rw_ ## fid ## _facet)]; \ } f;\ static __rw_facet* const pf = \ Index: src/ctype.cpp === --- src/ctype.cpp (revision 1390953) +++ src/ctype.cpp (working copy) @@ -627,7 +627,7 @@ _RWSTD_EXPORT __rw_facet* __rw_ct_ctype } else { static union { -void *align_; +long double align_; char data_ [sizeof (_STD::ctypechar)]; } f; static __rw_facet* const pf = Index: src/locale_classic.cpp === --- src/locale_classic.cpp (revision 1390953) +++ src/locale_classic.cpp (working copy) @@ -39,7 +39,7 @@ _RWSTD_NAMESPACE (__rw) { // static buffer for the classic C locale object static union { -void* _C_align; +long double _C_align; char _C_buf [sizeof (_STD::locale)]; } __rw_classic;
Re: STDCXX-1072 SPARC V8 mutex alignment requirements
On 09/28/12 08:29, Liviu Nicoara wrote: I have created the above and linked it to the closed STDCXX-1066. [...] IMO, the patch I attached does not break binary compatibility. Scratch this, I haven't thought it through. Thanks, Liviu
Re: STDCXX-1072 SPARC V8 mutex alignment requirements
On 09/28/12 08:45, Liviu Nicoara wrote: On 09/28/12 08:29, Liviu Nicoara wrote: I have created the above and linked it to the closed STDCXX-1066. [...] IMO, the patch I attached does not break binary compatibility. Scratch this, I haven't thought it through. Actually, after more thought, I believe that the patch itself is not causing binary incompatibility. The library built on a kernel-patched system will be binary incompatible with a library built on a previous system, regardless of the patch I presented earlier, because new mutex alignment requirements will have changed the size and layout of library objects. Comments are welcome. Thanks, Liviu
Re: STDCXX-1071 numpunct facet defect
On 09/28/12 11:01, Travis Vitek wrote: Only major versions can break binary. The versioning policy for stdcxx can be found here.. http://stdcxx.apache.org/versions.html Thanks, that clarifies things. Liviu
Re: STDCXX-1072 SPARC V8 mutex alignment requirements
On 09/28/12 11:45, Travis Vitek wrote: Liviu, Sorry I didn't look until just now, but it appears that I could have re-opened STDCXX-1066. I only see the 'Reopen Issue' button for STDCXX issues, but it is most definitely there. Perhaps there is some sort of permission issue for you? It's ok, I think it's somewhat cleaner to create a new one and link it to the old one. Even if clean was not a concern, it was within Stefan's options to close the incident. I don't know. Also, STDCXX-1066 appears to have been a duplicate of STDCXX-1040. Yep, forgot about it, I am thinking about linking that one, too. Thanks.
Re: STDCXX-1072 SPARC V8 mutex alignment requirements
On 09/28/12 11:32, Travis Vitek wrote: -Original Message- From: Liviu Nicoara Sent: Friday, September 28, 2012 5:29 AM In short, my reading about this issue is that the kernel patch changed the alignment of the userland mutex objects from a machine word to a double-word boundary. No changes are required of the users who use such objects in their programs unless users create mutex objects in buffers which may not be aligned on a proper boundary. Your reading of this is consistent with my previous understanding of the problem, so that is good. I still don't have access to a SPARC machine. Any feed-back and/or SPARC build results are more than welcome! I can provide build results for SPARCV9 if we want them, but I thought that the problem only came up on 32-bit SPARCV8 builds. I guess a -xarch=sparcv8 build will do. It is quite unlikely anybody has a real 32-bit SPARC hardware. The patch assumes the type `long double' exists on every platform. While I do believe that it is available everywhere, we have lots of conditional code guarding its use in the library now. If we are going to use `long double' in this context, we should guard it with _RWSTD_NO_LONG_DOUBLE. I think an even cleaner solution is to switch to using __rw_aligned_buffer instead. It gives us a single point of failure for alignment issues like this, and it makes the code self-documenting and easier to read. I took a cue from an alignment in the rw/_mutex.h code there. I'll do better. As for your concerns about binary compatibility, I think that everything should be okay. We aren't changing the size of anything that is being passed around, we're just changing its alignment. I could be wrong, but I'm pretty sure that we're safe. The library object sizes and layouts will change with or without our patch. Before the kernel patch our objects will have one layout and size and different ones afterwards. E.g.: struct locale { int c; pthread_mutex_t lock; }; before kernel patching you'd have no padding between the members. That's what I thought when firing that second post about compatibility. Liviu
Re: STDCXX-1072 SPARC V8 mutex alignment requirements
On 09/28/12 13:51, Martin Sebor wrote: [...] One other comment: I would suggest choosing subjects for bug reports that reflect the problem rather than a fix for it or a rationale for it. For STDCXX-1066 I think something like Library mutex objects misaligned on SPARCV8 would better capture the problem than the current title. (It's also up to us to rename an issue if we find it more descriptive than the original.) I can't do that myself. I looked at that and 1056 and there is no button for me to reopen, or to edit stuff which is not mine. Needless to say, I like my issues better. But I have no problems with the re-opening of the closed ones. Thanks, Liviu
Re: STDCXX-1071 numpunct facet defect
On 9/28/12 11:01 AM, Travis Vitek wrote: Only major versions can break binary. The versioning policy for stdcxx can be found here.. http://stdcxx.apache.org/versions.html I have renamed the binary-incompatible patch as patch-5.0.x.diff. Thanks, Liviu Travis -Original Message- From: Liviu Nicoara [mailto:nikko...@hates.ms] Sent: Friday, September 28, 2012 3:52 AM To: dev@stdcxx.apache.org Subject: Re: STDCXX-1071 numpunct facet defect I thought I replied but I see no trace of my post: On 09/27/12 20:27, Martin Sebor wrote: On 09/27/2012 06:41 AM, Liviu Nicoara wrote: On 09/26/12 20:12, Liviu Nicoara wrote: I have created STDCXX-1071 and linked to STDCXX-1056. [...] I am open to all questions, the more the better. Most of my opinions have been expressed earlier, but please ask if you want to know more. I am attaching here the proposed (4.3.x) patch and the timings results (after re-verifying the correctness of the timing program and the results). The 4.2.x patch, the 4.3.x patch, the test program and the results file are also attached to the incident. The patch isn't binary compatible. We can't remove data members in a minor release. We could only do it in a major release. I posted two patches, one for 4.3.x and one for 4.2.x (the latter is binary compatible). My understanding was that minor version revisions may break binary compatibility. Liviu
Re: [jira] [Closed] (STDCXX-1066) SPARCV8 requires pthread_mutex_t and pthread_cond_t to be aligned on an 8-byte boundary
On 09/27/12 07:15, Pavel Heimlich, a.k.a. hajma wrote: 2012/9/26 Liviu Nicoara nikko...@hates.ms: On 09/26/12 05:49, Pavel Heimlich, a.k.a. hajma wrote: 2012/9/26 Liviu Nicoara nikko...@hates.ms: On 9/25/12 7:56 PM, Stefan Teleman (JIRA) wrote: [ https://issues.apache.org/jira/browse/STDCXX-1066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anybody around here, except Stefan, who has access to a SPARC V8 machine updated to the specified kernel update or later, and who is willing to run a simple test program? It's a 5 minute job at most. Please point me to the test program. Hi Pavel, I attached it. IIRC Solaris had both Solaris threads API and a POSIX threads API on top of it. Could you please give it a run with MUTEX defined to both mutex_t and pthread_mutex_t? Might need to tweak the includes. Here it is, the define seems to make no difference. BTW I'm not sure about the bug description, was Solaris 10 ever supported on a sparcv8? I ran it on the crappiest machine I could find, but all that is sparcv9. Sun Blade 1000 - UltraSPARC-III+(sparcv9), running Solaris 10u9 bash-3.00# /opt/SUNWspro/bin/CC a.cpp bash-3.00# ./a.out 24 ffbffcd8 ffbffcf0 ffbffd08 ffbffd20 ffbffd38 ffbffd50 ffbffd68 ffbffd80 ffbffbe0 ffbffc00 ffbffc20 ffbffc40 ffbffc60 ffbffc80 ffbffca0 ffbffcc0 bash-3.00# /opt/SUNWspro/bin/CC -xarch=v8 ./a.cpp bash-3.00# ./a.out 24 ffbffcd8 ffbffcf0 ffbffd08 ffbffd20 ffbffd38 ffbffd50 ffbffd68 ffbffd80 ffbffbe0 ffbffc00 ffbffc20 ffbffc40 ffbffc60 ffbffc80 ffbffca0 ffbffcc0 bash-3.00# vi a.cpp (--- pthread_mutex_t) bash-3.00# /opt/SUNWspro/bin/CC a.cpp bash-3.00# ./a.out 24 ffbffcd8 ffbffcf0 ffbffd08 ffbffd20 ffbffd38 ffbffd50 ffbffd68 ffbffd80 ffbffbe0 ffbffc00 ffbffc20 ffbffc40 ffbffc60 ffbffc80 ffbffca0 ffbffcc0 bash-3.00# /opt/SUNWspro/bin/CC -xarch=v8 ./a.cpp bash-3.00# ./a.out 24 ffbffcd8 ffbffcf0 ffbffd08 ffbffd20 ffbffd38 ffbffd50 ffbffd68 ffbffd80 ffbffbe0 ffbffc00 ffbffc20 ffbffc40 ffbffc60 ffbffc80 ffbffca0 ffbffcc0 HTH Very much appreciated. Liviu
Re: [PATCH] STDCXX-1069 [was: Re: SUNPro 5.12 compilation failure in optimized Linux builds]
On 09/23/12 12:15, Liviu Nicoara wrote: On 09/22/12 00:51, Liviu Nicoara wrote: Optimized (but not debug) builds fail to compile setlocale.cpp with the error: A patch and a comment have been attached to the issue. I am posting it here to save a trip to the JIRA issue. Any feed-back is appreciated. The issue can be tracked down to the interaction between the definition of _XOPEN_SOURCE in src/setlocale.cpp and util/path.cpp, the system headers, and Solaris Studio C . It seems that the original purpose for the definitions has been lost because both S_IFDIR and symlink are defined and declared, respectively, on modern Linux (edit: by default). Moreover, it seems that the original definition of _XOPEN_SOURCE was incorrect, without a value. I propose we remove the two blocks – see the patch. Thanks, Liviu Index: src/setlocale.cpp === --- src/setlocale.cpp (revision 1388733) +++ src/setlocale.cpp (working copy) @@ -33,11 +33,6 @@ #include rw/_defs.h -#if defined (__linux__) !defined (_XOPEN_SOURCE) - // need S_IFDIR on Linux -# define _XOPEN_SOURCE -#endif // __linux__ !_XOPEN_SOURCE - #include locale.h // for setlocale() #include stdlib.h // for getenv() #include string.h // for memcpy(), strcmp() Index: util/path.cpp === --- util/path.cpp (revision 1388733) +++ util/path.cpp (working copy) @@ -27,16 +27,6 @@ **/ #ifndef _WIN32 -# ifdef __linux__ - // for symlink() -#ifndef _XOPEN_SOURCE -# define _XOPEN_SOURCE -#endif -#ifndef _XOPEN_SOURCE_EXTENDED -# define _XOPEN_SOURCE_EXTENDED -#endif -# endif // __linux__ - # include unistd.h // for getcwd() # include sys/stat.h// for struct stat, stat() #else
Re: STDCXX-1068 and alignment
On 09/22/12 16:22, Stefan Teleman wrote: On Sat, Sep 22, 2012 at 4:07 PM, Liviu Nicoara nikko...@hates.ms wrote: Stefan, could you please elaborate on the misaligned reads/writes that you observed on those platforms? What was failing? Several tests from the test harness were failing because of it, and for no other reason. I'll have to go through my stuff at work and look at the test reports from the Compiler Guys, we were testing these back in March 2011, and right now I can't remember exactly which tests were failing. I'll have the details in a few days. Any news on this one? Thanks, Liviu
Re: STDCXX-1071 numpunct facet defect
On 09/26/12 20:12, Liviu Nicoara wrote: I have created STDCXX-1071 and linked to STDCXX-1056. [...] I am open to all questions, the more the better. Most of my opinions have been expressed earlier, but please ask if you want to know more. I am attaching here the proposed (4.3.x) patch and the timings results (after re-verifying the correctness of the timing program and the results). The 4.2.x patch, the 4.3.x patch, the test program and the results file are also attached to the incident. Thanks, Liviu Index: include/loc/_numpunct.h === --- include/loc/_numpunct.h (revision 1388733) +++ include/loc/_numpunct.h (working copy) @@ -61,7 +61,7 @@ struct numpunct: _RW::__rw_facet string_type; _EXPLICIT numpunct (_RWSTD_SIZE_T __ref = 0) -: _RW::__rw_facet (__ref), _C_flags (0) { } +: _RW::__rw_facet (__ref) { } virtual ~numpunct () _RWSTD_ATTRIBUTE_NOTHROW; @@ -109,15 +109,6 @@ protected: virtual string_type do_falsename () const { return _RW::__rw_get_punct (this, _RW::__rw_fn, char_type ()); } - -private: - -int _C_flags; // bitmap of cached data valid flags -string _C_grouping;// cached results of virtual members -string_type _C_truename; -string_type _C_falsename; -char_type _C_decimal_point; -char_type _C_thousands_sep; }; @@ -139,17 +130,7 @@ template class _CharT inline _TYPENAME numpunct_CharT::char_type numpunct_CharT::decimal_point () const { -if (!(_C_flags _RW::__rw_dp)) { - -numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this); - -// [try to] get the decimal point first (may throw) -// then set a flag to avoid future initializations -__self-_C_decimal_point = do_decimal_point (); -__self-_C_flags |= _RW::__rw_dp; -} - -return _C_decimal_point; +return do_decimal_point (); } @@ -157,34 +138,14 @@ template class _CharT inline _TYPENAME numpunct_CharT::char_type numpunct_CharT::thousands_sep () const { -if (!(_C_flags _RW::__rw_ts)) { - -numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this); - -// [try to] get the thousands_sep first (may throw) -// then set a flag to avoid future initializations -__self-_C_thousands_sep = do_thousands_sep (); -__self-_C_flags |= _RW::__rw_ts; -} - -return _C_thousands_sep; +return do_thousands_sep (); } template class _CharT inline string numpunct_CharT::grouping () const { -if (!(_C_flags _RW::__rw_gr)) { - -numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this); - -// [try to] get the grouping first (may throw) -// then set a flag to avoid future initializations -__self-_C_grouping = do_grouping (); -__self-_C_flags|= _RW::__rw_gr; -} - -return _C_grouping; +return do_grouping (); } @@ -192,17 +153,7 @@ template class _CharT inline _TYPENAME numpunct_CharT::string_type numpunct_CharT::truename () const { -if (!(_C_flags _RW::__rw_tn)) { - -numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this); - -// [try to] get the true name first (may throw) -// then set a flag to avoid future initializations -__self-_C_truename = do_truename (); -__self-_C_flags|= _RW::__rw_tn; -} - -return _C_truename; +return do_truename (); } @@ -210,17 +161,7 @@ template class _CharT inline _TYPENAME numpunct_CharT::string_type numpunct_CharT::falsename () const { -if (!(_C_flags _RW::__rw_fn)) { - -numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this); - -// [try to] get the false name first (may throw) -// then set a flag to avoid future initializations -__self-_C_falsename = do_falsename (); -__self-_C_flags |= _RW::__rw_fn; -} - -return _C_falsename; +return do_falsename (); } // #endif _RWSTD_NO_EXT_NUMPUNCT_PRIMARY -*- mode: org -*- * Machines: ** iMac, Intel, 4 cores: $ uname -a; gcc -v Darwin imax 11.4.0 Darwin Kernel Version 11.4.0: Mon Apr 9 19:32:15 PDT 2012; root:xnu-1699.26.8~1/RELEASE_X86_64 x86_64 gcc version 4.7.1 (GCC) ** Linux Slackware, AMD, 16 cores: $ uname -a; gcc -v Linux behemoth 2.6.37.6 #3 SMP Sat Apr 9 22:49:32 CDT 2011 x86_64 AMD Opteron(tm) Processor 6134 AuthenticAMD GNU/Linux gcc version 4.5.2 (GCC) * Method ** Library Apply the patch. Build an optimized library (I used 12S in all runs). Build the library, rwtest, and locale database: $ nice make -Clib $ nice make -Cbin locales $ nice make -Crwtest Properly export the necessary envar if running against STDCXX locale database or unset, otherwise: $ export RWSTD_LOCALE_ROOT=/path/to/.../nls ** Test program Place the multi-threaded program source file, t.cpp, in srcdir/tests/localization and run make
Re: svn commit: r1389337 - /stdcxx/branches/4.2.x/tests/support/atomic_xchg.cpp
On 09/25/12 23:20, Martin Sebor wrote: On 09/24/2012 06:38 AM, lnico...@apache.org wrote: Author: lnicoara Date: Mon Sep 24 12:38:17 2012 New Revision: 1389337 URL: http://svn.apache.org/viewvc?rev=1389337view=rev Log: 2012-09-24 Liviu Nicoara nikko...@hates.ms * tests/support/atomic_xchg.cpp: (run_test) moved counter volatile qualification to match STDCXX conventions (see r1388733) The stdcxx ChangeLog format follows the GNU Coding Standard: http://www.gnu.org/prep/standards/html_node/Change-Logs.html I.e., this entry should look like this: * tests/support/atomic_xchg.cpp (run_test): Move counter volatile qualification to match STDCXX convention (see r1388733). (The colon follows the parenthesized function name. The text should also end with a period.) The svn propedit command lets you change a comment for an already committed change. Sorry about that. I fixed it. L
Reopen closed incidents
I want to reopen the closed incidents, esp. 1056. On a second thought it might be more useful to open new ones and link the old ones so that we don't mess with ownership, etc. If nobody objects to this I will go forward with the latter. Thanks.
STDCXX-1071 numpunct facet defect
I have created STDCXX-1071 and linked to STDCXX-1056. I have reduced the scope to numpunct because moneypunct is not failing for me. If someone has a moneypunct failure listing I want to see it. I have reduced the library code to a failing test case. I have attached there the reduced program. The program shows the same failures like the library (interestingly enough it does not fail when removing the data caching). I have (re)attached the forwarding patches I created for 1056, one binary compatible, the other not. Martin, my wish is to move this to some completion. Please let me know if there is something else you think I should do to speed this up. I am open to all questions, the more the better. Most of my opinions have been expressed earlier, but please ask if you want to know more. Thanks, Liviu
Re: STDCXX-1056 : numpunct fix
On 09/24/12 22:28, Martin Sebor wrote: On 09/20/2012 06:46 PM, Stefan Teleman wrote: On Thu, Sep 20, 2012 at 8:39 PM, Liviu Nicoaranikko...@hates.ms wrote: I have not created this requirement out of thin air. STDCXX development has functioned in this manner for as long as I remember. If it does not suit you, that's fine. That would explain why these bugs are present in the first place. If the official method of determining thread-safety here is fprintf(3C), then we have a much bigger problem than 22.locale.numpunct.mt. Tests that end in .mt.cpp use the RW_ASSERT() macro to verify correctness in multithreaded programs. There is no way at the moment to run a thread analyzer as part of the test suite, although it would be a nice addition. As I said in my other reply, though, not all data races necessarily indicate bugs, so we'd need a way to distinguish between know/benign races and the rest. (Unless we decide to eliminate all races and accept the performance penalty.) Preserving the lazy initialization is possible: we perfectly forward to the implementation with either: 1. no additional synchronization (preserves fast reads of the data), while fully knowing that there are data races in __rw_get_numpunct/_C_get_data there, or 2.we put a big lock on top of all and every facet operations (the version that would be race free). Caching seems out of the window for now, more so in the presence of unguarded reads of facet's data. The only argument for (1) is that at the moment we don't have a failing test yet, and not for lack of arduous trying. However, if atomic/membar APIs would be recognized by the compiler, lazy initialization of the facet and caching of data would be attainable in a quite simple fashion. Liviu
Re: STDCXX-1056 : numpunct fix
On 09/24/12 23:50, Stefan Teleman wrote: On Mon, Sep 24, 2012 at 10:03 PM, Martin Sebor mse...@gmail.com wrote: FWIW, there are race conditions in stdcxx. Some of them are by design and benign on the systems the library runs on (although I acknowledge that some others may be bugs). One such benign date race is: timeT1 T2 0x = N 1x = N read x where x is a scalar that can be accessed atomically by the CPU and the compiler. I think some of the lazy facet initialization falls under this class. It would be nice to fix them but only if it doesn't slow things down. The others need to be fixed in any case. The race conditions I am talking about are not benign. Caching failures aside, we have no failing tests. Before saying they are malign you need to objectively show a failing program. It shows the types of race conditions it's reporting: these are read/write race conditions, not read/read. The thread analyzer's html filter doesn't show the types of races reported as clearly as the command-line analyzer which has a windowing GUI. Martin's example above is a read-write race. In the absence of a failing test case it is quite unreasonable to modify the current implementation. Liviu
STDCXX-1070
I filed 1070, failure to build 22.locale.collate.cpp on Linux with gcc 4.7.1. Gcc, Comeau and Clang fail to compile it, Intel and Sun are fine. It looks to me like Intel and Sun compilers are not doing the right thing. A small test case and a patch have been attached. The failing code has been reduced to: $ cat test.cpp; g++ -c test.cpp template class charT void f () { g (charT ('a')); } template class charT void g (charT) { } int h () { return f char (), 0; } test.cpp: In instantiation of 'void f() \[with charT = char\]': test.cpp:14:23: required from here test.cpp:4:5: error: 'g' was not declared in this scope, and no declarations were found by argument-dependent lookup at the point of instantiation \[-fpermissive\] test.cpp:8:6: note: 'templateclass charT void g(charT)' declared here, later in the translation unit Liviu
Re: [jira] [Closed] (STDCXX-1056) std::moneypunct and std::numpunct implementations are not thread-safe
On 9/25/12 7:56 PM, Stefan Teleman (JIRA) wrote: [ https://issues.apache.org/jira/browse/STDCXX-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan, I don't think it's ok to close this bug. The race conditions are there and we have not come to a completely satisfactory conclusion on how to deal with them, or even if we should deal with them. Whichever it is we gotta keep this discussion open. I sure hope you want to be a part of it. FWIW, I have spent quite some time looking at your proposed patch and I am going to reopen the incident. If I can. Liviu Stefan Teleman closed STDCXX-1056. -- Resolution: Won't Fix Bug will not be fixed. Upstream refuses to acknowledge the existence of the bug in spite of objective evidence to the contrary. std::moneypunct and std::numpunct implementations are not thread-safe - Key: STDCXX-1056 URL: https://issues.apache.org/jira/browse/STDCXX-1056 Project: C++ Standard Library Issue Type: Bug Components: 22. Localization Affects Versions: 4.2.1, 4.2.x, 4.3.x, 5.0.0 Environment: Solaris 10 and 11, RedHat and OpenSuSE Linux, Sun C++ Compilers 12.1, 12.2, 12.3 Issue is independent of platform and/or compiler. Reporter: Stefan Teleman Labels: thread-safety Fix For: 4.2.x, 4.3.x, 5.0.0 Attachments: 22.locale.numpunct.mt.out, facet.cpp.diff, locale_body.cpp.diff, punct.cpp.diff, runtests-linux32-all.out, runtests-linux64-all.out, runtests.out, STDCXX-1056-additional-timings.tgz, stdcxx-1056.patch, stdcxx-1056-timings.tgz, stdcxx-4.2.x-numpunct-perfect-forwarding.patch, stdcxx-4.3.x-numpunct-perfect-forwarding.patch several member functions in std::moneypunct and std::numpunct return a std::string by value (as required by the Standard). The implication of return-by-value being that the caller owns the returned object. In the stdcxx implementation, the std::basic_string copy constructor uses a shared underlying buffer implementation. This shared buffer creates the first problem for these classes: although the std::string object returned by value *appears* to be owned by the caller, it is, in fact, not. In a mult-threaded environment, this underlying shared buffer can be subsequently modified by a different thread than the one who made the initial call. Furthermore, two or more different threads can access the same shared buffer at the same time, and modify it, resulting in undefined run-time behavior. The cure for this defect has two parts: 1. the member functions in question must truly return a copy by avoiding a call to the copy constructor, and using a constructor which creates a deep copy of the std::string. 2. access to these member functions must be serialized, in order to guarantee atomicity of the creation of the std::string being returned by value. Patch for 4.2.1 to follow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [jira] [Closed] (STDCXX-1066) SPARCV8 requires pthread_mutex_t and pthread_cond_t to be aligned on an 8-byte boundary
On 9/25/12 7:56 PM, Stefan Teleman (JIRA) wrote: [ https://issues.apache.org/jira/browse/STDCXX-1066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anybody around here, except Stefan, who has access to a SPARC V8 machine updated to the specified kernel update or later, and who is willing to run a simple test program? It's a 5 minute job at most. Thanks! Stefan Teleman closed STDCXX-1066. -- Resolution: Won't Fix Bug will not be fixed upstream. It is fixed in the Solaris releases. SPARCV8 requires pthread_mutex_t and pthread_cond_t to be aligned on an 8-byte boundary --- Key: STDCXX-1066 URL: https://issues.apache.org/jira/browse/STDCXX-1066 Project: C++ Standard Library Issue Type: Bug Components: Thread Safety Affects Versions: 4.2.1, 4.2.2, 4.2.x, 4.3.x Environment: Solaris 10 Update 6 or later on SPARCV8 [32-bit] Defect is compiler-independent - in reality it only affects Sun Studio and GCC. Reporter: Stefan Teleman Labels: features, runtime, threads Fix For: 4.2.2, 4.2.x, 4.3.x Attachments: _config-gcc.h.stdcxx-1066.patch, _config-sunpro.h.stdcxx-1066.patch, ctype.cpp.stdcxx-1066.patch, exception.cpp.stdcxx-1066.patch, ios.cpp.stdcxx-1066.patch, iostream.cpp.stdcxx-1066.patch, iostream.stdcxx-1066.patch, locale_body.cpp.stdcxx-1066.patch, locale_classic.cpp.stdcxx-1066.patch, messages.cpp.stdcxx-1066.patch, _mutex.h.stdcxx-1066.patch, time_put.cpp.stdcxx-1066.patch, use_facet.h.stdcxx-1066.patch Starting with Solaris 10 Update 6, on SPARCV8, pthread_mutex_t and pthread_cond_t MUST be aligned on an 8-byte boundary. Misaligned access will result in either SEGV or SIGBUS. There are numerous places in the multi-threaded version of stdcxx where pthread_mutex_t and/or pthread_cond_t types are contained within an union, but with an enforced alignment different than 8. All these instances must be corrected, and #ifdef-guarded for SPARCV8. Patches to follow shortly, this is just opening the issue. Warning: the patchset resolving this issue is very large, and it affects a large number of files. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: STDCXX-1056 : numpunct fix
On 9/24/12 11:06 PM, Stefan Teleman wrote: On Mon, Sep 24, 2012 at 7:48 PM, Liviu Nicoara nikko...@hates.ms wrote: Stefan, was it your intention to completely eliminate all the race conditions with this last patch? Is this what the tools showed in your environment? https://issues.apache.org/jira/browse/STDCXX-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13461452#comment-13461452 Yes, all the race conditions in std::numpunct and std::moneypunct. Not all the race conditions in stdcxx. FWIW, I have analyzed the performance of the latest patch from Stefan. The test program is the same test program used in the timings attachments to the incident. The version of the library was 4.2.1, compiler gcc 4.7.1, with: 1. The simple, perfect forwarding patch, with no caching, which does eliminate race conditions. 2. Stefan's latest patch 3. A patch I created which truly, if crudely, eliminates all facet race conditions by using a top-level lock for all facet operations(*) The numbers from the first test run are already available in the incident. The numbers for the second test run were a bit bigger, which is expected because of the additional synchronization. E.g., I have found that the test, run against the STDCXX locale database, gives: $ time ./t en_US.UTF-8 4 500 4, 500 real0m19.210s user0m32.659s sys 0m36.507s where the first patch gives: $ time ./t en_US.UTF-8 4 5000; done 4, 5000 real0m7.961s user0m31.211s sys 0m0.003s Notice the number of loops, 10x larger in the second set of number. Now, for the third patch, the numbers are embarrassingly large. It practically slows the program to a crawl because it does not do any reads of facet data without holding the lock. This, I wager, eliminates all races in the numpunct/moneypunct facet. See: $ time ./t en_US.UTF-8 4 10 4, 10 real0m24.316s user0m28.213s sys 0m6.633s Now that is a whooping performance hit. I went back to see an explanation between the numbers I see with Stefan's more performing patch, and my lock-all patch. I believe that there are still places in Stefan's patch where the facet makes unguarded reads of facet data, e.g.: loc/_numpunct.h: 190 // returns a pointer to the facet's implementation data 191 // if it exists, 0 otherwise 192 const void* _C_data () const { 193 return _C_impsize ? _C_impdata 194 : _RWSTD_CONST_CAST (__rw_facet*, this)-_C_get_data (); 195 } where _C_impsize/_C_impdata are read outside of a lock scope, etc. I.e., a thread may read the _C_impsize while, say, another is still writing the _C_impdata member variable. I conclude, even with such simple testing, that any solution that attempts to completely eliminate the race conditions in these facets will incur a penalty, with the safest (top-level locking) the worst by orders of magnitude and any other solution somewhere in between. Please let me know if you have any questions. Liviu (*) Will attach it as soon as we figure out how to undo the closing of the issue.
Re: [PATCH] STDCXX-853
On 09/24/12 01:18, Travis Vitek wrote: Liviu, Should the volatile be to the left of the intT typename here? I know it is equivalent, but it is weird to look at the line of code below and see that we're following two different conventions. Thanks, will do. Travis ___ From: Liviu Nicoara Sent: Sunday, September 23, 2012 8:34 AM To: dev@stdcxx.apache.org Subject: [PATCH] STDCXX-853 Umm, I didn't think to search for a corresponding incident and I considered the defect to be so minor as to not warrant an issue. The following patch has been applied already on 4.2.x: Index: tests/support/atomic_xchg.cpp === --- tests/support/atomic_xchg.cpp (revision 1388732) +++ tests/support/atomic_xchg.cpp (revision 1388733) @@ -297,7 +297,7 @@ void run_test (intT, thr_args_base::tag_ // compute the expected result, skipping zeros by incrementing // expect twice when it overflows and wraps around to 0 (zero is // used as the lock variable in thread_routine() above) -intT expect = intT (1); +intT volatile expect = intT (1); const unsigned long nincr = (Args::nthreads_ * Args::nincr_) / 2U; -- And now I see with eye serene The very pulse of the machine.
Re: STDCXX-1056 : numpunct fix
On 9/24/12 12:06 AM, Stefan Teleman wrote: On Fri, Sep 21, 2012 at 9:10 AM, Liviu Nicoara nikko...@hates.ms wrote: On 09/21/12 05:13, Stefan Teleman wrote: On Fri, Sep 21, 2012 at 2:28 AM, Travis Vitek travis.vi...@roguewave.com wrote: I have provided this list with test results showing that my patch *does* fix the race condition problems identified by all the tools at my disposal. I'm willing to bet you never looked at it. You dismissed it outright, just because you didn't like the *idea* that increasing the size of the cache, and eliminating that useless cache resizing would play an important role in eliminating the race conditions. I looked at it in great detail and I am sure Travis read it too. The facet/locale buffer management is a red herring. Apparently, we cannot convince you, regardless of the argument. That's fine by me. This bug [STDCXX-1056] was updated over the weekend with new comments. Here's the link to the comments, for the record: Stefan, was it your intention to completely eliminate all the race conditions with this last patch? Is this what the tools showed in your environment? Thanks, Liviu https://issues.apache.org/jira/browse/STDCXX-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13461452#comment-13461452
Re: STDCXX-1066 [was: Re: STDCXX forks]
On 09/16/12 12:03, Stefan Teleman wrote: On Sat, Sep 15, 2012 at 5:47 PM, Liviu Nicoara nikko...@hates.ms wrote: I merely wanted to point out that restoring the default packing is not the same as restoring the packing to the previous value in effect. Given this, I thought about an alternative way of forcing this alignment, e.g., via a union, aligned on an appropriate type. I see an advantage here in that most of the changes would occur where we define the 'primitive' mutex and condition wrappers, saving a few pre-processor conditionals and pragmas along the way. I understood what you were saying. I just don't understand under what _sane_ circumstances a program would #include a system library header file with a previously set packing to something other than default. Or would expect a non-default packing to work during a function call to a sytem library. That's an insane configuration on any operating system that I know of, not just on SPARCV8. I can't think of a valid scenario, either. I guess the point is moot. A few more questions, if you will, as I am going through the changes: 1. I see similarities with 1040, should/would you close that one? 2. The issue only exists in MT builds, should there be a guard in configs? 3. The align reference docs talk only about aligning variables, not types. Is that different on SPARC? 4. I see rw/_mutex.h has alignment pragmas for both __rw_mutex_base class and its mutex member; same for __rw_static_mutex and its static member, etc. How does that work? 5. Why is __rw_guard aligned explicitly? I see it only contains a pointer to a mutex object. 6. The docs mention that the pragma must use the mangled variables names but I don't see that in the patch. Thanks! Liviu
[PATCH] STDCXX-853
Umm, I didn't think to search for a corresponding incident and I considered the defect to be so minor as to not warrant an issue. The following patch has been applied already on 4.2.x: Index: tests/support/atomic_xchg.cpp === --- tests/support/atomic_xchg.cpp (revision 1388732) +++ tests/support/atomic_xchg.cpp (revision 1388733) @@ -297,7 +297,7 @@ void run_test (intT, thr_args_base::tag_ // compute the expected result, skipping zeros by incrementing // expect twice when it overflows and wraps around to 0 (zero is // used as the lock variable in thread_routine() above) -intT expect = intT (1); +intT volatile expect = intT (1); const unsigned long nincr = (Args::nthreads_ * Args::nincr_) / 2U;
[PATCH] STDCXX-1069 [was: Re: SUNPro 5.12 compilation failure in optimized Linux builds]
On 09/22/12 00:51, Liviu Nicoara wrote: Optimized (but not debug) builds fail to compile setlocale.cpp with the error: A patch and a comment have been attached to the issue. Thanks, Liviu
Re: STDCXX-1066 [was: Re: STDCXX forks]
On 9/23/12 3:48 PM, Stefan Teleman wrote: On Sun, Sep 23, 2012 at 3:26 PM, Liviu Nicoara nikko...@hates.ms wrote: To be honest it's quite bizarre that you cannot share that with us. Is it really a trade secret? How can that be the case if Oracle customers are also required to perform the same alignment, perhaps using the same techniques like you showed in the patch? That's the problem. I don't know what is and what is not a trade secret, or copyrighted information, or unauthorized intellectual property disclosure anymore. I think it's in the eye of the beholder. At Sun it was very clear. Believe it or not, I had to get written approval from the Legal Counsel's Office in order to be able to share these patches. And that in spite of the fact that these patches are published, and have already been published for *years*. That clarifies things. Thanks. Liviu
Re: STDCXX-1066 [was: Re: STDCXX forks]
On 9/23/12 5:50 PM, Stefan Teleman wrote: On Sun, Sep 23, 2012 at 5:23 PM, Stefan Teleman stefan.tele...@gmail.com wrote: The second URL says this: QUOTE Due to a change in the implementation of the userland mutexes introduced by CR 6296770 in KU 137111-01, objects of type mutex_t and pthread_mutex_t must start at 8-byte aligned addresses. If this requirement is not satisfied, all non-compliant applications on Solaris/SPARC may fail with the signal SEGV with a callstack similar to the following one or with similar callstacks containing the function mutex_trylock_process. \*_atomic_cas_64(0x141f2c, 0x0, 0xff00, 0x1651, 0xff00, 0x466d90) set_lock_byte64(0x0, 0x1651, 0xff00, 0x0, 0xfec82a00, 0x0) fast_process_lock(0x141f24, 0x0, 0x1, 0x1, 0x0, 0xfeae5780) /QUOTE Here's a link to an official datatype alignment table for SPARCV8: http://docs.oracle.com/cd/E19205-01/819-5267/bkbkl/index.html The interesting table is: Table B–2 Storage Sizes and Default Alignments in Bytes I see nothing really outstanding here. What is it that I should pay attention to? Thanks, Liviu
STDCXX-1056 DCII [was: Re: STDCXX-1056 [was: Re: STDCXX forks]]
On 9/17/12 5:42 PM, Liviu Nicoara wrote: [...] However, after more careful thought, I think there is a problem there even though we don't have an objective proof for it, yet. The writes are not atomic and they function just like DCII, being subject to both compiler reordering and out of order execution. E.g., it is assumed that the writes to the _C_impsize and _C_impdata members occur in the program order, which would make unguarded reads/tests of _C_impsize safe. This is what the facet initialization and access code does, conceptually: if (_C_impsize) { // already initialized, use data } else { // initialize mutex.lock (); _C_impdata = get_data_from_database (); _C_impsize = get_data_size (); mutex.unlock (); } The mutex lock and unlock operations introduce memory barriers but they are not guaranteeing the order in which the two writes occur. If the writes would occur in the exact program order, unguarded reads and tests of _C_impsize would be safe. Unfortunately, a thread may see the _C_impsize variable being updated before the _C_impdata. Elaborating on the above -- facet.cpp:~250, the code is actually closer to this: if (_C_impsize) { // already initialized, use _C_impdata } else { // initialize mutex.lock (); if (_C_impsize) return _C_impdata; void* const pv = ...; size_t sz = ...; _C_impdata = pv; // 1 _C_impsize = sz; // 2 mutex.unlock (); } (1) and (2) can be reordered. Also, when mapping the STDCXX locale database, facet.cpp:288, there is another defect writing the two variables: _C_impdata = __rw_get_facet_data (cat, _C_impsize, 0, codeset); ^^^ ||| Set early in __rw_get_facet_data -+++ The only way to order the above writes, preserve the fast checking for initialization, and generally salvaging the current algorithm is a full memory barrier in between the writes, e.g.: if (_C_impsize) { // already initialized, use data } else { // ... _C_impdata = pv; full_membar (); _C_impsize = sz; // ... } Atomics are part of C++11, I wonder if anybody here has working experience with the atomic API. A mutex locking will not do because the lock is only acquiring and the write before the lock can shift down. Of course I could be wrong. Any input is welcome. Thanks, Liviu
STDCXX-1068 and alignment
Stefan, could you please elaborate on the misaligned reads/writes that you observed on those platforms? What was failing? Thanks, Liviu
Re: STDCXX-1056 : numpunct fix
On 09/21/12 05:13, Stefan Teleman wrote: On Fri, Sep 21, 2012 at 2:28 AM, Travis Vitek travis.vi...@roguewave.com wrote: I have provided this list with test results showing that my patch *does* fix the race condition problems identified by all the tools at my disposal. I'm willing to bet you never looked at it. You dismissed it outright, just because you didn't like the *idea* that increasing the size of the cache, and eliminating that useless cache resizing would play an important role in eliminating the race conditions. I looked at it in great detail and I am sure Travis read it too. The facet/locale buffer management is a red herring. Apparently, we cannot convince you, regardless of the argument. That's fine by me. I have yet to see an alternative patch being proposed here, which would eliminate the race conditions, and which I am willing to test just as thoroughly as I've tested mine. The only thing i've seen are continued attempts at dismissing the existence of these race conditions, as reported by the instrumentation tools, based on some general axioms about their accuracy. No-one on this list has a clue as to how the SunPro Thread Analyzer actually works, because it's not open source, and none of you work at Oracle, therefore you can't see the code. But everyone just *knows* that it's inaccurate, or that it reports false positives. We are not dismissing the analysis. We are dismissing your interpretation of it. The tool is not indicating a defect, but a potential defect. I am calling it potential because, as my test case (http://tinyurl.com/97ks9gc) indicates, one can have a race condition which is not a defect in a concurrent environment. As long as you, or anyone else, continues to blame the instrumentation tools for the bug, and as long as anyone here continues to suggest that the only acceptable proof of the existence of this bug is some other program which needs to be written using fprintf(3C), and as long as no-one here is willing to provide an alternative patch which demonstrably eliminates 100% of the reported race conditions, this entire back-and-forth about the existence of these race conditions, the accuracy of the tools and what they are reporting is nothing but a giant waste of time. The correction of 1056 does not necessarily need to eliminate the analyzer warnings. Aiming for that is simply unreasonable, not because it is not attainable (after all, you demonstrated it), but because it is inappropriately and insufficiently argued for the situation at hand. Liviu
SUNPro 5.12 compilation failure in optimized Linux builds
Optimized (but not debug) builds fail to compile setlocale.cpp with the error: $ cat t.cpp; CC -c t.cpp #define _XOPEN_SOURCE #include cwchar /opt/sunpro/12_3/prod/include/cc/wchar.h, line 90: Error: tm is not defined. /opt/sunpro/12_3/prod/include/cc/wchar.h, line 92: Error: fgetwc is not defined. ... and so on for every wide-char function in wchar.h. The definition of _XOPEN_SOURCE at the top of setlocale.cpp is not entirely correct in that it interacts with the various guards in the C library headers. AFAICT, Oracle/Sun wchar.h includes C library wchar.h, which includes C library wctype.h, which includes again wchar.h, coming full circle to the compiler wchar.h which uses the names w/o them being defined, yet. The fact that Oracle/Sun headers include_next corresponding C library headers outside of a inclusion guard does not help. (Debug builds are not affected because of interaction from the difference in the structure and includes of rw/_traits.h in debug vs. optimized builds). The poor man's fix is to guard that _XOPEN_SOURCE define for SUNPro builds (see below). However, I am not sure whose side the error is. Thanks, Liviu PS. Is it still called SUNPro? Oracles seems it has sanitized that name out on their website. Index: src/setlocale.cpp === --- src/setlocale.cpp (revision 1388733) +++ src/setlocale.cpp (working copy) @@ -34,8 +34,10 @@ #include rw/_defs.h #if defined (__linux__) !defined (_XOPEN_SOURCE) +# if !defined (__SUNPRO_CC) // need S_IFDIR on Linux # define _XOPEN_SOURCE +# endif // !__SUNPRO_CC #endif // __linux__ !_XOPEN_SOURCE #include locale.h // for setlocale()
Re: STDCXX-1056 : numpunct fix
Thanks for the feed-back. Please see below. On Sep 19, 2012, at 10:02 PM, Stefan Teleman wrote: On Wed, Sep 19, 2012 at 8:51 PM, Liviu Nicoara nikko...@hates.ms wrote: I think you are referring to `live' cache objects and the code which specifically adjusts the size of the buffer according to the number of `live' locales and/or facets in it. In that respect I would not call that eviction because locales and facets with non-zero reference counters are never evicted. But anyhoo, this is semantics. Bottom line is the locale/facet buffer management code follows a principle of economy. Yes it does. But we have to choose between economy and efficiency. To clarify: The overhead of having unused pointers in the cache is sizeof(void*) times the number of unused slots. This is 2012. Even an entry-level Android cell phone comes with 1GB system memory. If we want to talk about embedded systems, where memory constraints are more stringent than cell phones, then we're not talking about Apache stdcxx anymore, or any other open souce of the C++ Standard Library. These types of systems use C++ for embedded systems, which is a different animal altogether: no exceptions support, no rtti. For example see, Green Hills: http://www.ghs.com/ec++.html. And even they have become more relaxed about memory constraints. They use BOOST. Bottom line: so what if 16 pointers in this 32 pointer slots cache never get used. The maximum amount of wasted memory for these 16 pointers is 128 bytes, on a 64-bit machine with 8-byte sized pointers. Can we live with that in 2012, a year when a $500 laptop comes with 4GB RAM out of the box? I would pick 128 bytes of allocated but unused memory over random and entirely avoidable memory churn any day. The argument is plausible and fine as far as brainstorming goes. But have you measured the amount of memory consumed by all STDCXX locale data loaded in one process? How much absolute time is spent in resizing the locale and facet buffers? What is the gain in space and time performance with such a change versus without? Just how fragmented the heap becomes and is there a performance impact because of it, etc.? IOW, before changing the status quo one must show an objective defect, produce a body of evidence, including a failing test case for the argument. My goal: I would be very happy if any application using Apache stdcxx would reach its peak instantiation level of localization (read: max number of locales and facets instantiated and cached, for the application's particular use case), and would then stabilize at that level *without* having to resize and re-sort the cache, *ever*. That is a locale cache I can love. I love binary searches on sorted containers. Wrecking the container with insertions or deletions, and then having to re-sort it again, not so much. Especially when I can't figure out why we're doing it in the first place. And I love minimalistic code, and hate waste at the same time, especially in a general purpose library. To each its own. Hey Stefan, are the above also timing the changes? Nah, I didn't bother with the timings - yet - for a very simple reason: in order to use instrumentation, both with SunPro and with Intel compilers, optimization of any kind must be disabled. On SunPro you have to pass -xkeepframe=%all (which disables tail-call optimization as well), in addition to passing -xO0 and -g. So the timings for these unoptimized experiments would have been completely irrelevant. Well, I think you are the only one around here with access to SPARC hardware, your input is very precious in this sense. Also, this is the reason for which I kept asking that question earlier: do we have currently any failing locale MT test when numpunct does just perfect forwarding, with no caching? I.e., changing just _numpunct.h and no other source file (as to silence thread analyzers warnings) does any locale (or other) MT tests fail? I would greatly appreciate it if you could give it a run on your hardware if you don't already know the answer. The discussion has been productive. But I object to the patch as is because it goes out of the scope of the original incident. I think this patch should only touch the MT defect detected by the failing test cases. If you think the other parts you changed are defects you should open corresponding issues in JIRA and have them discussed in their separate rooms. Thanks, Liviu
Re: STDCXX-1056 : numpunct fix
On 09/20/12 13:11, Stefan Teleman wrote: On Thu, Sep 20, 2012 at 8:07 AM, Liviu Nicoara nikko...@hates.ms wrote: But have you measured the amount of memory consumed by all STDCXX locale data loaded in one process? How much absolute time is spent in resizing the locale and facet buffers? What is the gain in space and time performance with such a change versus without? Just how fragmented the heap becomes and is there a performance impact because of it, etc.? IOW, before changing the status quo one must show an objective defect, produce a body of evidence, including a failing test case for the argument. sizeof(std::locale) * number_of_locales. I was more interested in the underlying locale data, not the C++ objects. I'll let you in on a little secret: once you call setlocale(3C) and localeconv(3C), the Standard C Library doesn't release its own locale handles until process termination. So you might think you save a lot of memory by destroying and constructing the same locales. You're really not. It's the Standard C Library locale data which takes up a lot of space. Thanks for the secret, I appreciate it. Did you mean to say that the C Standard mandates that?! What I do know for a fact that this optimization did, was to cause the races conditions reported by 4 different thread analyzers. Race conditions are a show-stopper for me, and they are not negotiable. No, that optimization was not causing the MT defect you originally noted. Saying so only shows a lack of familiarity with the implementation. And I love minimalistic code, and hate waste at the same time, especially in a general purpose library. To each its own. Here's a helpful quote: We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. It's from Donald Knuth. Please, no. And I love correct code which doesn't cause thread analyzers to report more than 12000 race conditions for just one test case. I've said it before and I will say it again: race conditions are a showstopper and are not negotiable. Period. Stefan, you are not being correct by a consensus of thread analyzers output or by being loud, or by pounding your fist on the table. This being said I will continue to exercise due diligence and put in the necessary time and effort to provide you with the most useful feed-back I can. I see that you missed my question in the post before: did you see a failure in the locale MT tests in your SPARC runs? If you do not want to run that test, that is fine, just let me know. Thanks, Liviu
Re: STDCXX-1056 : numpunct fix
On Sep 20, 2012, at 5:31 PM, Stefan Teleman wrote: On Thu, Sep 20, 2012 at 5:07 PM, Liviu Nicoara nikko...@hates.ms wrote: To answer your question [...]: yes, the MT failures occur on SPARC as well, on both SPARCV8 and SPARCV9, and the race conditions are reported on *ALL* plaforms tested, even after having applied your _numpunct.h patch. This patch alone does *NOT* solve the problem. Stefan, I want to be clear. You are talking about a patch identical in nature to the one I have attached now. Just want to be 100% sure we are talking about the same thing. This one still produces failures (crashes, assertions, etc.) in the locale MT tests on SPARC and elsewhere in your builds? Thanks, Liviu
Re: STDCXX-1056 : numpunct fix
On Sep 20, 2012, at 7:37 PM, Stefan Teleman wrote: On Thu, Sep 20, 2012 at 7:34 PM, Wojciech Meyer wojciech.me...@googlemail.com wrote: Hi, My perceptions is by reading through the whole thread - we should not trust 100% external tools to asses the safety of the code. I don't think there exist an algorithm that produces no false positives. That's said I admire Stefan's approach, but we should ask the question are we MT safe enough? I would say from what I read here: yes. Based on what objective metric? The only gold currency that anyone in here accepts without reservations are failing test cases. I believe I have seen some exceptions to the golden rule in my RW time, but I can't recall any specific instance. Liviu
Re: STDCXX-1056 : numpunct fix
On Sep 20, 2012, at 5:23 PM, Stefan Teleman wrote: On Thu, Sep 20, 2012 at 4:45 PM, Travis Vitek travis.vi...@roguewave.com wrote: I'll let you in on a little secret: once you call setlocale(3C) and localeconv(3C), the Standard C Library doesn't release its own locale handles until process termination. So you might think you save a lot of memory by destroying and constructing the same locales. You're really not. It's the Standard C Library locale data which takes up a lot of space. You have a working knowledge of all Standard C Library implementations? I happen to do, yes, for the operating systems that I've been testing on. I also happen to know that you don't. This fact alone pretty much closes up *this* particular discussion. Do yourself, and this mailing list a favor: either write a patch which addresses all of your concerns *AND* eliminates all the race conditions reported, or stop this pseudo software engineering bullshit via email. There is apparently, a high concentration of know-it-alls on this mailing list, who are much better at detecting race conditions and thread unsafety than the tools themselves. Too bad they aren't as good at figuring out their own bugs. The sniping is uncalled for. There are no enemies here, no one is after you. There is criticism though and you are expected to take it and argue your point of view. If you can't stand the heat, get out of the kitchen. It took eight months for anyone here to even *acknowledge* that numpunct and moneypunct do have, in fact, a thread safety problem. Never mind that the test case for these facets had been crashing for 4 years. To be quite blunt and to the point, after 8 months of denying obvious facts, your credibility is quite a bit under question at this point. Yes, we are busy with other stuff. I wish I got paid to work on this instead. This entire discussion has become a perfect illustration with what's wrong with the ASF, as reported here: http://www.mikealrogers.com/posts/apache-considered-harmful.html I actually read it. I see a guy complaining he can't have it his way. No problem. One can fork this project at any time and start it anew, by themselves, or in the company of like programmers elsewhere. For better or worse Apache got STDCXX from RogueWave. Complaining about it is like complaining that Apple doesn't give us iPhones for free; after all we are the power users and we know what to do with them. L
Re: STDCXX-1056 : numpunct fix
On Sep 20, 2012, at 7:45 PM, Stefan Teleman wrote: On Thu, Sep 20, 2012 at 7:22 PM, Liviu Nicoara nikko...@hates.ms wrote: Stefan, I want to be clear. You are talking about a patch identical in nature to the one I have attached now. Just want to be 100% sure we are talking about the same thing. This one still produces failures (crashes, assertions, etc.) in the locale MT tests on SPARC and elsewhere in your builds? On September 17, 2012 I have posted the following message to this list: http://www.mail-archive.com/dev@stdcxx.apache.org/msg01929.html In that message, there is a link to my SPARC thread-safety test results: http://s247136804.onlinehome.us/stdcxx-1056-SPARC-20120917/22.locale.numpunct.mt.nts.1.er.html/index.html This test was run with the following _numpunct.h file: http://s247136804.onlinehome.us/stdcxx-1056-SPARC-20120917/22.locale.numpunct.mt.nts.1.er.html/file.14.src.txt.html The test above shows 12440 race conditions detected for a test run of 22.locale.numpunct.mt, with --nthreads=8 --nloops=1. Did you ever look at these test results? From reading your email, I realize that you never looked at it. That is the only possible explanation as to why you're asking now for SPARC test results, today being September 20, 2012. I see, there is a confusion about this. Probably nobody explained it before. A failing test case means a test case that causes the abnormal termination of the execution of the program or creates evidence of abnormal data in the program execution. In this respect please see the atomic add and exchange tests as classical examples of what I mean. I have read all your emails in detail and I have inspected all your attachments, modulo the ones I could not open. Thanks, Liviu
Re: STDCXX-1056 : numpunct fix
On Sep 20, 2012, at 8:02 PM, Stefan Teleman wrote: On Thu, Sep 20, 2012 at 7:52 PM, Liviu Nicoara nikko...@hates.ms wrote: On Sep 20, 2012, at 7:49 PM, Stefan Teleman wrote: On Thu, Sep 20, 2012 at 7:40 PM, Liviu Nicoara nikko...@hates.ms wrote: The only gold currency that anyone in here accepts without reservations are failing test cases. I believe I have seen some exceptions to the golden rule in my RW time, but I can't recall any specific instance. That may be a valid metric here. The only one. Any programmer worth his salt -- I am borrowing your words here -- would be able to demonstrate the validity of his point of view with a test case. I did. There are 12440 race conditions detected for an incomplete run of 22.locale.numpunct.mt. By incomplete I mean: it did not run with its default nthreads and nloops which I believe are 8 threads and 20 loop iterations. That is not it, and you did not. Please pay attention: given your assertion that a race condition is a defect that causes an abnormal execution of the program during which the program sees abnormal, incorrect states (read: variable values) it should be easy for you to craft a program that shows evidence of that by either printing those values, or aborting upon detecting them, etc. [...] and overall just email bullshit. Stop using that word. L
Re: STDCXX-1056 : numpunct fix
On Sep 20, 2012, at 8:59 PM, Stefan Teleman wrote: On Thu, Sep 20, 2012 at 8:44 PM, C. Bergström cbergst...@pathscale.com wrote: fencepost comment - The results are based on tools and I don't think he has a large program which actually triggers the conditions. (Creating one may take quite some time) I do have a program which triggers the race conditions: 22.locale.numpunct.mt. Part of the official test harness. The real reason why they don't want to accept what the instrumentation tools are saying is very simple: they don't LIKE reading what the tools are saying. So, blame the tools, pretend that as long as it doesn't crash again there's no bug and hope for the best. I cannot include an analyzer output as a regression test in the test suite. But I am very glad this is on a public mailing list, so everyone can read what's going on here. ?
Re: STDCXX-1056 [was: Re: STDCXX forks]
On 09/18/12 08:55, Stefan Teleman wrote: On Mon, Sep 17, 2012 at 11:17 AM, Liviu Nicoara nikko...@hates.ms wrote: I hope you agree that this synchronization is sufficient for the facet initialization and reading of facet data. I have reduced the number of reported race conditions in 22.locale.numpunct.mt from 12440: I am attaching a test program which, while 100% MT-safe, is flagged by the Solaris thread analyzer. The test program contains, conceptually, the exact same of shared variable access like the locale management code, the facet management code and the facet data management code. This proves, if there was a need, that the tool results are to be further analyzed and interpreted but not taken at literal value. The fix we are looking for is one which corrects the initial MT failure observed in caching the facet data, as well as it preserves the unguarded reads of facet data for performance reasons. Thanks. Liviu http://s247136804.onlinehome.us/stdcxx-1056-SPARC-20120917/22.locale.numpunct.mt.nts.1.er.html/index.html to 288: http://s247136804.onlinehome.us/stdcxx-1056-20120918/22.locale.numpunct.mt.5.er.html/index.html The changes are in the following files: http://s247136804.onlinehome.us/stdcxx-1056-20120918/facet.cpp http://s247136804.onlinehome.us/stdcxx-1056-20120918/punct.cpp _numpunct.h looks like this: http://s247136804.onlinehome.us/stdcxx-1056-20120918/_numpunct.h With these changes, no races conditions are repoted for any of the functions in std::numpunctT. Still, there are 288 race conditions being reported in __rw_locale::__rw_locale and in std::locale::_C_get_facet. We need to identify the source and cause of these race conditions and correct them as well. This is not a complete solution to the problem, because we still have to re-write the chunk of code I eliminated from facet.cpp. It is only step one towards finding a real solution. But, at least for now, we have pinpointed where the source of these race conditions is located, and what causing it. The test program was run as: ./22.locale.numpunct.mt --nthreads=8 --nloops=1. More to follow. --Stefan #include stdio.h #include stdlib.h #include pthread.h #include unistd.h #define MAX_THREADS 128 static long counter = 0, nloops = 1000, nthreads = 16; static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; extern C { static void* f (void* pv) { for (size_t i = 0; i nloops; ++i) { if (counter == 0) { pthread_mutex_lock (lock); if (counter == 0) ++counter; pthread_mutex_unlock (lock); } else { // counter value is safe to use here } } printf (%ld\n, counter); return 0; } } int main (int argc, char** argv) { switch (argc) { case 3: nloops = atol (argv [2]); case 2: nthreads = atol (argv [1]); break; } pthread_t tid [MAX_THREADS] = { 0 }; if (nthreads MAX_THREADS) nthreads = MAX_THREADS; for (int i = 0; i nthreads; ++i) { if (pthread_create (tid + i, 0, f, 0)) exit (-1); } for (int i = 0; i nthreads; ++i) { if (tid [i]) pthread_join (tid [i], 0); } return 0; }
Re: STDCXX-1056 [was: Re: STDCXX forks]
On 09/18/12 13:21, Stefan Teleman wrote: On Tue, Sep 18, 2012 at 12:43 PM, Liviu Nicoara nikko...@hates.ms wrote: I am attaching a test program which, while 100% MT-safe, is flagged by the Solaris thread analyzer. The program as written is not thread safe. It is reading the value of the counter variable and performing a zero comparison outside of a mutex lock: Stefan, I urge you to consider the argument on its merits. Yes, the thread analyzer flags it, but it is nonetheless MT-safe. Specifically: 1. writes are properly synchronized wrt themselves 2. reads are inherently thread-safe wrt themselves 3. reads are properly synchronized wrt writes 4.no thread can possibly observe an intermediate or otherwise incomplete value. I will also add that the flag is either 0 or 1 during the execution of the program, with only one transition from 0 to 1, performed by one single thread. I will concede that I might be wrong and I am open to arguments. I would accept as a counter-argument this program if you can show a runtime failure. I would also accept as argument a scenario under which two threads would see inconsistent/incorrect values or write the variable more than once, etc. Thanks, Liviu for (size_t i = 0; i nloops; ++i) { if (counter == 0) { // --- pthread_mutex_lock (lock); if (counter == 0) ++counter; pthread_mutex_unlock (lock); } else { // counter value is safe to use here } }
Re: STDCXX-1056 [was: Re: STDCXX forks]
On 09/18/12 18:24, Stefan Teleman wrote: On Tue, Sep 18, 2012 at 4:35 PM, Liviu Nicoara nikko...@hates.ms wrote: I will concede that I might be wrong and I am open to arguments. I would accept as a counter-argument this program if you can show a runtime failure. The the first read of the counter variable is outside a mutex lock correct? The read is followed by a 0 comparison, correct? Correct. What guarantees that between the read and the comparison the value of the counter variable hasn't been modified by another thread? The thread currently doing the comparison cannot guarantee it: it hasn't locked the mutex. Other threads may be running - actually they probably are. Another thread may have already acquired the mutex and incremented the value of counter. Your thread has no way of knowing if that has happened, because it does not yet have exclusive access to the counter variable. It will, after it acquires the mutex. It does not matter if the value is changed concurrently in between the reading of it and the actual comparison. The value can only be 0, in which case it takes the branch that locks the mutex and the next read will absolutely see a fresh value; or it can be 1 in which case it has already been initialized and this thread does not pay the penalty of a lock anymore. Where is it reading the variable from? A register? Is it declared volatile? L2 cache? L3 cache? Irrelevant. If the value has not been changed it can be read from either the cache or directly from the memory. If the value has been changed concurrently in between the reading and the actual comparison, the processor sees either a stale value of 0, or the new value 1. It is guaranteed that any thread reading the value after the (one and only) unlock will see a fresh value. The program, as you wrote it, implicitly acknowledges that it is not thread safe. That is the point of the double check: one before the mutex lock, and one after it. That is a misstatement of the program intentions. See below. The point of the first check has nothing to do with thread-safety, and everything to do with a minor optimization: if the value stored in variable counter is already not zero, then there's no point in locking the mutex or performing the increment. The first check is indeed an optimization. It is the point of this exercise to show that the unguarded reads in the localization library are not defects and the code, simplified in my test case, is thread safe in exactly the respects I mentioned before: the program only observes consistent, correct values (program states) in a concurrent environment. HTH. Liviu
Re: STDCXX-1056 [was: Re: STDCXX forks]
On 9/18/12 7:04 PM, Stefan Teleman wrote: On Tue, Sep 18, 2012 at 6:42 PM, Liviu Nicoara nikko...@hates.ms wrote: The first check is indeed an optimization. It is the point of this exercise to show that the unguarded reads in the localization library are not defects and the code, simplified in my test case, is thread safe in exactly the respects I mentioned before: the program only observes consistent, correct values (program states) in a concurrent environment. The Intel Thread Analyzer shows a data race in the exact same spot the SunPro Thread Analyzer shows a data race, and in the exact same spot I have stated earlier that there is a data race. The results are available here: http://s247136804.onlinehome.us/liviu-mt-test/22.mt.test.r000ti3.inspxez As usual, it is an Intel Inspector XE 2003 Analysis Results Blob. It opens with the Intel Inspector. i don't know what format it is in, or whether or not it can be read by other means. So far we have SunPro 12.3/Linux/Intel, SunPro 12.3/Solaris/Intel, SunPro 12.3/Solaris/SPARC and Intel 2003/Linux/Intel saying that there is a race condition in your program. And now I would like to get back to solving the thread-unsafety and race conditions problems in Apache stdcxx. Stefan, you cannot ignore others' arguments and embrace a 'no' attitude; the tools are not saying what you think they do. They all show the same thing because they all do the exact same analysis and we could delve into that, too. I sincerely hoped you could come up with some sort of a logical argument that would prove my latest test case wrong, or at least show it failing at runtime. My time is also limited and I have already spent more time than I wanted to trying to present a logical argument, down to the simplest test cases. I say we defer this to Martin for when he will be back from vacation, next week. For the record, because people may not have the energy to read all this back and forth, I will summarize here the gist of this thread: 1. The facet data caching is not MT-safe 2. The facet data initialization (STDCXX or system locales) is safe (*) 3. There is no unit test currently showing a failure in (2) 4. Timing results show that caching may be slower than non-caching in MT builds 5. A fix should, ideally, be binary compatible 6. A fix should, ideally, preserve performance or increase it 7. There is one patch, currently attached to the issue, by Stefan 8. Other partial patches are referenced from this thread Please correct me if I missed anything. The above summary is a good starting point for a new thread dealing with this issue. HTH. Liviu (*) When perfectly forwarding, with a theoretical concern I asked in an earlier post, directed at Martin.
Re: STDCXX-1056 [was: Re: STDCXX forks]
On 09/17/12 09:51, Stefan Teleman wrote: On Mon, Sep 17, 2012 at 8:46 AM, Liviu Nicoara nikko...@hates.ms wrote: In the meantime I would like to stress again that __rw_get_numpunct is perfectly thread-safe and does not need extra locking for perfect forwarding. So, by removing the test for if (!(_C_flags _RW::__rw_gr)) (or any other bitmask for that matter), the functions which were thread-unsafe - and were exhibiting all the symptoms of a run-time race condition -, magically became thread-safe? I have looked *extensively* at the code in __rw_get_numpunct. It is inherently thread-unsafe. I mean to say that no extra locking is necessary when the public interface forwards and no caching is done. In more detail: __rw_get_numpunct code is entered upon a call from the protected virtual interface. It calls facet _C_data member function which either returns objects constructed from previously-initialized POD data (in which case no locking is necessary), or it attempts to first initialize the facet data from the STDCXX database. If the latter, the execution goes in the facet data initialization code which is appropriately synchronized, see this stack trace: (gdb) bt #0 __rw::__rw_facet::_C_get_data (this=0x6e10a0) at srcdir/stdcxx/branches/4.2.x/src/facet.cpp:179 #1 0x0043788b in __rw::__rw_facet::_C_data (this=0x6e10a0) at srcdir/stdcxx/branches/4.2.x/include/loc/_facet.h:194 #2 0x0044874f in __rw::__rw_get_numpunct (pfacet=0x6e10a0, flags=4) at srcdir/stdcxx/branches/4.2.x/src/punct.cpp:80 #3 0x00449a14 in __rw::__rw_get_punct (pfacet=0x6e10a0, flags=4) at srcdir/stdcxx/branches/4.2.x/src/punct.cpp:578 #4 0x0041cecf in std::numpunctchar::do_grouping (this=0x6e10a0) at srcdir/stdcxx/branches/4.2.x/include/loc/_numpunct.h:99 #5 0x0041bd31 in std::numpunctchar::grouping (this=0x6e10a0) at srcdir/stdcxx/branches/4.2.x/include/loc/_numpunct.h:190 #6 0x004036b8 in main (argc=2, argv=0x7fffe018) at srcdir/stdcxx/branches/4.2.x/tests/localization/s.cpp:29 (gdb) When the above fails, the facet data has not been initialized, e.g., when there is no STDCXX locale database, or the name of the locale does not refer to a locale in STDCXX database. The __rw_get_numpunct function will attempt next to use libc and system locales, via the __rw_setlocale class, which again is properly synchronized and the scope of that lock extends to cover the facet data initialization. See this stack trace: (gdb) bt #0 __rw::__rw_setlocale::__rw_setlocale (this=0x7fffdd30, locname=0x6e112a en_US.utf8, cat=6, nothrow=0) at srcdir/stdcxx/branches/4.2.x/src/setlocale.cpp:84 #1 0x00448974 in __rw::__rw_get_numpunct (pfacet=0x6e10a0, flags=4) at srcdir/stdcxx/branches/4.2.x/src/punct.cpp:134 #2 0x00449a14 in __rw::__rw_get_punct (pfacet=0x6e10a0, flags=4) at srcdir/stdcxx/branches/4.2.x/src/punct.cpp:578 #3 0x0041cecf in std::numpunctchar::do_grouping (this=0x6e10a0) at srcdir/stdcxx/branches/4.2.x/include/loc/_numpunct.h:99 #4 0x0041bd31 in std::numpunctchar::grouping (this=0x6e10a0) at srcdir/stdcxx/branches/4.2.x/include/loc/_numpunct.h:190 #5 0x004036b8 in main (argc=2, argv=0x7fffe018) at srcdir/stdcxx/branches/4.2.x/tests/localization/s.cpp:29 I hope you agree that this synchronization is sufficient for the facet initialization and reading of facet data. Thanks. Liviu
Re: STDCXX-1056 [was: Re: STDCXX forks]
On 09/17/12 11:21, Stefan Teleman wrote: On Mon, Sep 17, 2012 at 11:17 AM, Liviu Nicoara nikko...@hates.ms wrote: I hope you agree that this synchronization is sufficient for the facet initialization and reading of facet data. Sorry, I do not agree. Two different thread analyzers from two different compilers written by two different compiler writers tell me not to. Stefan, that is indeed your prerogative. However, please keep in mind that tools may be buggy or may have limitations beyond what is advertised. I do not ask you to have faith in my analysis, which would be absurd, but to look for yourself, exercise due diligence in testing the code and draw your own conclusions. Thanks, Liviu
Sun fbe assembler manual
Hi all, I need a reference manual for the fbe assembler. I am interested in the syntax of the `.type' directive. Do you know where such a manual would be located? The Solaris assembler manual I found on Oracle's website does not mention the .type directive. Thanks! Liviu
Re: STDCXX-1056 [was: Re: STDCXX forks]
On 9/16/12 3:20 AM, Stefan Teleman wrote: On Sat, Sep 15, 2012 at 4:53 PM, Liviu Nicoara nikko...@hates.ms wrote: Now, to clear the confusion I created: the timing numbers I posted in the attachment stdcxx-1056-timings.tgz to STDCXX-1066 (09/11/2012) showed that a perfectly forwarding, no caching public interface (exemplified by a changed grouping) performs better than the current implementation. It was that test case that I hoped you could time, perhaps on SPARC, in both MT and ST builds. The t.cpp program is for MT, s.cpp for ST. I got your patch, and have tested it. Thanks, Stefan. I looked over it and it seems very similar to, and somewhat more detailed than gprof profiling output. I am going to update the incident shortly with a more detailed timing measurements on my side, in the form of a new attachment. Just FYI in case you still don't get notifications. Liviu
Re: STDCXX-1056 [was: Re: STDCXX forks]
On 9/16/12 11:21 AM, Liviu Nicoara wrote: On 9/16/12 3:20 AM, Stefan Teleman wrote: On Sat, Sep 15, 2012 at 4:53 PM, Liviu Nicoara nikko...@hates.ms wrote: Now, to clear the confusion I created: the timing numbers I posted in the attachment stdcxx-1056-timings.tgz to STDCXX-1066 (09/11/2012) showed that a perfectly forwarding, no caching public interface (exemplified by a changed grouping) performs better than the current implementation. It was that test case that I hoped you could time, perhaps on SPARC, in both MT and ST builds. The t.cpp program is for MT, s.cpp for ST. I got your patch, and have tested it. Thanks, Stefan. I looked over it and it seems very similar to, and somewhat more detailed than gprof profiling output. I am going to update the incident shortly with a more detailed timing measurements on my side, in the form of a new attachment. Just FYI in case you still don't get notifications. I have attached a new set of results to the incident, in the form of the archive: http://tinyurl.com/9drzg4e Please see the content for a description of the library changes (_numpunct.h file), the MT test program (t.cpp) and the results collected through two separate builds on two different machines (results.txt file). Thanks. Liviu
STDCXX-1066 [was: Re: STDCXX forks]
On 9/1/12 1:52 PM, Stefan Teleman wrote: On Sat, Sep 1, 2012 at 12:15 PM, I opened yesterday STDCXX-1066: https://issues.apache.org/jira/browse/STDCXX-1056 about the pthread_mutex_t/pthread_cond_t alignment on SPARCV8. I'll have patches done this weekend. Achtung: the patchset is very large and touches a very large number of files. It's strange that I didn't get an email about STDCXX-1066. Hi Stefan, I have read through the patches attached to the incident, then I briefly read about the SunPro pragma align and pack. Two questions: 1. AFAICT, the use of the packing pragma may interfere with a user's setting of the same value. I.e., a user sets the packing in their sources and then, directly or not, includes an STDCXX header. It seems to me that in such a situation, our setting of the packing value would interfere with the rest of the user's translation unit, since there is no way to `restore' the previous packing value. Something along the lines of: // user source file #pragma pack (X) // X != 8 #include iostream struct UserDef { // different alignment than X ? // ... }; Is my understanding correct? 2. The patches are against 4.2.1, but the change would be binary incompatible with the already released 4.2.1 branch. Do you plan to have this fix in 4.3.x? Thanks. Liviu
Re: STDCXX-1056 [was: Re: STDCXX forks]
On 9/15/12 1:51 PM, Stefan Teleman wrote: On Sat, Sep 15, 2012 at 9:01 AM, Liviu Nicoara nikko...@hates.ms wrote: That is funny. What compiler are you using? What does the following test case return for you? It's the Intel compiler with the patched stdcxx for the wrong case and GCC 4.7.1 + GNU libstdc++ for the correct case. GCC + GNU libstdc++ are correct. The patched facet does not call the protected do_*() virtual functions from their public counterparts, as it is required to do by the Standard. Instead, it returns the data mebers directly (the data members were initialized in the constructor). That is the patch you proposed, which is indeed much better performing than using a mutex lock. Unfortunately, in doing so, overriding the virtual functions in a derived facet type becomes pointless. Ahh, I see now. You are indeed right, that patch is defective. I was under the impression that we were discussing the (later) attachment stdcxx-1056-timings.tgz which contains a perfectly forwarding implementation of the facet public grouping method. The timings I attached there were the ones I thought we were discussing all along. Now, to clear the confusion I created: the timing numbers I posted in the attachment stdcxx-1056-timings.tgz to STDCXX-1066 (09/11/2012) showed that a perfectly forwarding, no caching public interface (exemplified by a changed grouping) performs better than the current implementation. It was that test case that I hoped you could time, perhaps on SPARC, in both MT and ST builds. The t.cpp program is for MT, s.cpp for ST. Please let me know if this clarifies things. I apologize for the misunderstanding. Liviu
Re: STDCXX-1066 [was: Re: STDCXX forks]
On 9/15/12 2:57 PM, Stefan Teleman wrote: On Sat, Sep 15, 2012 at 1:02 PM, Liviu Nicoara nikko...@hates.ms wrote: I have read through the patches attached to the incident, then I briefly read about the SunPro pragma align and pack. Two questions: 1. AFAICT, the use of the packing pragma may interfere with a user's setting of the same value. I.e., a user sets the packing in their sources and then, directly or not, includes an STDCXX header. It seems to me that in such a situation, our setting of the packing value would interfere with the rest of the user's translation unit, since there is no way to `restore' the previous packing value. Something along the lines of: // user source file #pragma pack (X) // X != 8 #include iostream struct UserDef { // different alignment than X ? // ... }; Is my understanding correct? Yes, you are indeed correct, Sir. :-) However, for every #pragma pack(8) there should be a corresponding subsequent #pragma pack(0). If there isn't, that's a patch bug. #pragma pack(0) restores the default alignment. So, there should be (there *must* be) no silent packing side-effects in user programs. Yes, but it restores the default packing, not an arbitrary one, potentially set by the user prior to including our headers. Say, the user sets 2, the default is 4 and we set 8. When we set it to default it goes back to 4, instead of the expected 2. Did I get this right? 2. The patches are against 4.2.1, but the change would be binary incompatible with the already released 4.2.1 branch. Do you plan to have this fix in 4.3.x? Yes, definitely for 4.2.x and 4.3.x. I sent them for 4.2.1 just because that's what I have handy. I see now. I tried to apply them to 4.2.x and some chunks failed, and I thought I was not applying them where they were intended to go. Thanks! Liviu
Re: STDCXX-1066 [was: Re: STDCXX forks]
On 9/15/12 5:19 PM, Stefan Teleman wrote: On Sat, Sep 15, 2012 at 4:57 PM, Liviu Nicoara nikko...@hates.ms wrote: Yes, but it restores the default packing, not an arbitrary one, potentially set by the user prior to including our headers. Say, the user sets 2, the default is 4 and we set 8. When we set it to default it goes back to 4, instead of the expected 2. Did I get this right? This is true, but leaving some arbirary pragma pack(N) (for N != 0) in effect for the duration of a program, and expecting it to work sounds like a very defective programming approach to me. It will certainly not work on SPARC at all. if the program needs to pack something in a certain specific way, it need to do so for that specific something only. Otherwise the side-effects of globally setting a non-default packing will destroy the program anyway. I merely wanted to point out that restoring the default packing is not the same as restoring the packing to the previous value in effect. Given this, I thought about an alternative way of forcing this alignment, e.g., via a union, aligned on an appropriate type. I see an advantage here in that most of the changes would occur where we define the 'primitive' mutex and condition wrappers, saving a few pre-processor conditionals and pragmas along the way. What do you think? Liviu
Re: STDCXX-1056 [was: Re: STDCXX forks]
On 09/12/12 00:21, Stefan Teleman wrote: On Tue, Sep 11, 2012 at 10:18 PM, Liviu Nicoara nikko...@hates.ms wrote: AFAICT, there are two cases to consider: 1. Using STDCXX locale database initializes the __rw_punct_t data in the first, properly synchronized pass through __rw_get_numpunct. All subsequent calls use the __rw_punct_t data to construct returned objects. 2. Using the C library locales does the same in the first pass, via setlocale and localeconv, but setlocale synchronization is via a per-process lock. The facet data, once initialized is used just like above. I probably missed this in the previous conversation, but did you detect a race condition in the tests if the facets are simply forwarding to the private virtual interface? I.e., did you detect that the facet initialization code is unsafe? I think the facet __rw_punct_t data is safely initialized in both cases, it's the caching that is done incorrectly. I originally thought so too, but now I'm having doubts. :-) And I haven't tracked it down with 100% accuracy yet. I saw today this comment in src/facet.cpp, line 358: // a per-process array of facet pointers sufficiently large // to hold (pointers to) all standard facets for 8 locales static __rw_facet* std_facet_buf [__rw_facet::_C_last_type * 8]; The localization code is cleverly dense and low-level. There are a few patterns that are used throughout, though. One is the function-local static buffers, e.g., the locales buffer and the facets buffer, in locale_body.cpp and facet.cpp, respectively. They both start from a reasonable size and are grown and shrunk according to the needs, but neither locales nor facets are ever evicted from them. See: locale_body:936 -- reduction of locale buffer locale_body:1006 -- expansion of locale buffer facet.cpp:397-- reduction of facet buffer facet.cpp:462-- expansion of facet buffer Facets buffer is guarded by a static guard in: __rw_facet::_C_manage:366 The locales buffer is protected by a static guard in: __rw_locale::_C_manage:880 [...] this is all investigative stuff for tomorrow. :-) and I agree with Martin that breaking ABI in a minor release is really not an option. I'm trying to find the best way of making these facets thread-safe while inflicting the least horrible performance hit. We are getting close to the point where we can summarize all these findings and gauge an appropriate course of action. I will run your tests tomorrow and let you know. :-) That would be very helpful! Liviu
Re: Board report time
On 09/11/12 08:15, Jim Jagielski wrote: It's time for our report to the board... what would we like to share? I see: o renewed discussion on health/viability of pmc o increased development being done o PMC expressing interest in moving to git This sounds about right. It should also mention that some members expressed interest in alternative licensing. Thanks. Liviu
Re: STDCXX-1056 [was: Re: STDCXX forks]
On 9/11/12 9:40 PM, Martin Sebor wrote: On 09/11/2012 04:15 PM, Stefan Teleman wrote: On Mon, Sep 10, 2012 at 4:24 PM, Stefan Teleman I think I have something which doesn't break BC - stay tuned because I'm testing it now. OK. So, here's a possible implementation of __rw_get_numpunct() with minimal locking, which passes the MT tests and does not break ABI: s247136804.onlinehome.us/stdcxx-1056-20120911/punct.cpp And the same for include/loc/_numpunct.h: http://s247136804.onlinehome.us/stdcxx-1056-20120911/_numpunct.h In _numpunct.h, all the functions perform no checks and no lazy initialization. They function simply as a pass-through to __rw_get_numpunct(). std::numpunctT's data members are now dead varaiables. The bad: performance is no better than with locking the mutex inside each of the std::numpunctT::*() functions, and with lazy instantiation. I wouldn't expect this to be faster than the original. In fact, I would expect it to be slower because each call to one of the public, non-virtual members results in a call to the out-of-line virtual functions (and another to __rw_get_moneypunct). Avoiding the overhead of such calls is the main and only reason why the caching exists. AFAICT, there are two cases to consider: 1. Using STDCXX locale database initializes the __rw_punct_t data in the first, properly synchronized pass through __rw_get_numpunct. All subsequent calls use the __rw_punct_t data to construct returned objects. 2. Using the C library locales does the same in the first pass, via setlocale and localeconv, but setlocale synchronization is via a per-process lock. The facet data, once initialized is used just like above. I probably missed this in the previous conversation, but did you detect a race condition in the tests if the facets are simply forwarding to the private virtual interface? I.e., did you detect that the facet initialization code is unsafe? I think the facet __rw_punct_t data is safely initialized in both cases, it's the caching that is done incorrectly. I'm afraid unoptimized timings don't tell us much. Neither does a comparison between two compilers, even on the same OS. I looked at Liviu's timings today. I was puzzled by the difference between (1) which, IIUC, is the current implementation (presumably an optimized, thread-safe build with the same compiler and OS) and (4), which, again IIUC, is the equivalent of your latest patch here (again, presumably optimized, thread safe, same compiler/OS). I'm having trouble envisioning how calling a virtual function to retrieve the value of grouping can possibly be faster than not calling it (and simply returning the value cached in the data member of the facet. The new results I attached to the issue come from a a bit clearer tests and they focus on just two cases: the current implementation vs. a non-caching one; the latter just forwards the grouping calls to the protected do_grouping, with _no_ other changes to the implementation. The timing numbers seem to show that MT builds fare far worse with the caching than without. Stefan, if you have the time, could you please infirm :) my conclusions by timing it on one of your machines? Thanks, Liviu
Re: STDCXX-1056 [was: Re: STDCXX forks]
On 09/10/12 15:01, Stefan Teleman wrote: On Mon, Sep 10, 2012 at 2:21 PM, Liviu Nicoara nikko...@hates.ms wrote: 4. Without caching of grouping values, grouping() delegates always to do_grouping(): real0m5.668s user1m11.389s sys 0m3.952s FWIW, 22.2.3.1.1 explicitly states that all of the decimal_point(), grouping(), truename(), falsename() must return their do_*() counterparts. Yes, of course. :-) I just meant to say that is _always_ delegating to do_grouping(), i.e., does no caching. Liviu --Stefan -- And now I see with eye serene The very pulse of the machine.
[PATCH] STDCXX-1067 Mac builds
The default compiler on recent Apple Macs is LLVM with Clang and gcc C++ front-ends. The compiler does not come with a C++ language support library. However, gcc Mac builds are fine with GNU stock compilers, modulo the issues for which I attach the patch below, for review. I built successfully and ran the test suite on both Mac and Linux, wigh gcc 4.5.4 and 4.5.2, respectively. Thanks. Liviu Index: src/x86_64/atomic.s === --- src/x86_64/atomic.s (revision 1382343) +++ src/x86_64/atomic.s (working copy) @@ -26,8 +26,17 @@ * **/ +#if defined (__MACH__) +// Mac OS X Mach-O assembler: no .type, power of two alignment +# define ALIGN_DIR .align 4 +# define TYPE_DIR(ignore,ignore2) +#else +# define ALIGN_DIR .align 16 +# define TYPE_DIR(sym,attr) .type sym, attr +#endif // __MACH__ + .text -.align 16 +ALIGN_DIR /*** * extern C int8_t __rw_atomic_xchg8 (int8_t *x, int8_t y); @@ -37,7 +46,7 @@ **/ .globl __rw_atomic_xchg8 -.type __rw_atomic_xchg8, @function +TYPE_DIR (__rw_atomic_xchg8, STT_FUNC) __rw_atomic_xchg8: /* ; int8_t (int8_t *x, int8_t y) */ movq %rdi, %rcx /* ; %rcx = x */ movb %sil, %al /* ; %al = y */ @@ -53,7 +62,7 @@ **/ .globl __rw_atomic_xchg16 -.type __rw_atomic_xchg16, @function +TYPE_DIR (__rw_atomic_xchg16, STT_FUNC) __rw_atomic_xchg16:/* ; int16_t (int16_t *x, int16_t y) */ movq %rdi, %rcx /* ; %rcx = x*/ movw %si, %ax/* ; %ax = y */ @@ -69,7 +78,7 @@ **/ .globl __rw_atomic_xchg32 -.type __rw_atomic_xchg32, @function +TYPE_DIR (__rw_atomic_xchg32, STT_FUNC) __rw_atomic_xchg32:/* ; int32_t (int32_t *x, int32_t y) */ movq %rdi, %rcx /* ; %rcx = x*/ movl %esi, %eax /* ; %eax = y*/ @@ -85,7 +94,7 @@ **/ .globl __rw_atomic_xchg64 -.type __rw_atomic_xchg64, @function +TYPE_DIR (__rw_atomic_xchg64, STT_FUNC) __rw_atomic_xchg64:/* ; int64_t (int64_t *x, int64_t y) */ movq %rdi, %rcx /* ; %rcx = x*/ movq %rsi, %rax /* ; %rax = y*/ @@ -101,7 +110,7 @@ **/ .globl __rw_atomic_add8 -.type __rw_atomic_add8, @function +TYPE_DIR (__rw_atomic_add8, STT_FUNC) __rw_atomic_add8: /* ; int8_t (int8_t *dst, int8_t inc) */ movq %rdi, %rcx /* ; %rcx = dst */ movl %esi, %eax /* ; %eax = inc */ @@ -123,7 +132,7 @@ **/ .globl __rw_atomic_add16 -.type __rw_atomic_add16, @function +TYPE_DIR (__rw_atomic_add16, STT_FUNC) __rw_atomic_add16: /* ; int16_t (int16_t *dst, int16_t inc) */ movq %rdi, %rcx /* ; %rcx = dst */ movw %si, %ax /* ; %ax = inc */ @@ -146,7 +155,7 @@ **/ .globl __rw_atomic_add32 -.type __rw_atomic_add32, @function +TYPE_DIR (__rw_atomic_add32, STT_FUNC) __rw_atomic_add32: /* ; int32_t (int32_t *dst, int32_t inc) */ movq %rdi, %rcx /* ; %rcx = dst */ movl %esi, %edx /* ; %edx = inc */ @@ -169,7 +178,7 @@ **/ .globl __rw_atomic_add64 -.type __rw_atomic_add64, @function +TYPE_DIR (__rw_atomic_add64, STT_FUNC) __rw_atomic_add64: /* ; int64_t (int64_t *dst, int64_t inc) */ movq %rdi, %rcx /* ; %rcx = dst */ movq %rsi, %rdx /* ; %edx = inc */ Index: etc/config/gcc.config === --- etc/config/gcc.config (revision 1382343) +++ etc/config/gcc.config (working copy) @@ -40,6 +40,7 @@ else ifeq ($(OSNAME),Darwin) OS_MAJOR := $(shell
Re: [PATCH] STDCXX-1067 Mac builds
On 9/9/12 7:07 PM, Wojciech Meyer wrote: Hi Liviu, I don't use Mac OS X at all but: Liviu Nicoara nikko...@hates.ms writes: The default compiler on recent Apple Macs is LLVM with Clang and gcc C++ front-ends. The compiler does not come with a C++ language support library. However, gcc Mac builds are fine with GNU stock compilers, modulo the issues for which I attach the patch below, for review. I think it will be a big win to support Clang for the community. AFAICT, there is only one problem with that, the lack of a complete language support library. Pathscale released a replacement for libsupc++/libgcc_s in the form of the pair libcxxrt/libunwind, which provide equivalent functionality on *BSD/Linux, but not Darwin. Perhaps Christopher can shed some light here. Liviu
Re: STDCXX-1056 [was: Re: STDCXX forks]
On 09/06/12 19:54, Martin Sebor wrote: I'm not sure how easily we can do that. Almost all of locale is initialized lazily. Some of the layers might depend on the facets being initialized lazily as well. This was a deliberate design choice. One of the constraints was to avoid dynamic initialization or allocation at startup. [...] There would be a performance degradation. IMHO, it would be minor and would simplify the code considerably. After finally being able to reproduce the defect with SunPro 12.3 on x86_64 I tried to remove the lazy initialization and a (smaller) test case now passes. I am attaching the program and the numpunct patch. Out of curiosity, does it still work if you move the locale into the thread function? Like this: I created the test case because I wanted something more specific and with more predictable results. Initially, the locale object was created in the thread and it was passing. Also, does the 27.objects test pass with this patch? Will try. I don't think that we can actually use this patch, I bundled it with the test so that people can test the approach right away. I don't know if we have tests for it but writing to cerr/cout in a bad_alloc handler should succeed. I.e., this should not abort: try { struct rlimit rlim = { }; getrlimit (RLIMIT_DATA, rlim); rlim.rlim_cur = 0; setrlimit (RLIMIT_DATA, rlim); for ( ; ; ) new int; } catch (bad_alloc) { cout caught bad alloc: 123 '\n'; } I see.. Liviu
Re: A question (or two) of procedure, etc.
On 09/06/12 23:00, Martin Sebor wrote: Every project has certain branch strategy, I'm not sure about this so maybe Martin can advice. I prefer to develop on trunk and cherry pick to the other branches avoiding bulk merges (and that's in both directions). We've done most work on 4.2.x for historical reasons. I think a better strategy is to develop, as you suggest, on trunk which has the least restrictive commit policy, and merge changes out to the more restrictive branches as appropriate. If open to voting, I am for trunk first, branches later. Liviu
Re: STDCXX-1056 [was: Re: STDCXX forks]
On 09/06/12 22:58, Stefan Teleman wrote: On Thu, Sep 6, 2012 at 7:31 PM, Liviu Nicoara nikko...@hates.ms wrote: There would be a performance degradation. IMHO, it would be minor and would simplify the code considerably. After finally being able to reproduce the defect with SunPro 12.3 on x86_64 I tried to remove the lazy initialization and a (smaller) test case now passes. I am attaching the program and the numpunct patch. With your patches, the performance is much much better: I didn't think about timing it, thanks! The result is somewhat expected without the synchronization. However, the full construction-time initialization is inappropriate if you take in consideration the delegation from the public API to the protected virtual API of the facet class, and the possibility of overrides in subclasses. Also, lazy initialization follows the principle of economy of effort, and caching increases efficiency. If possible in a robust way, I would rather have those two. :-) Liviu # INFO (S1) (10 lines): # TEXT: # COMPILER: Intel C++, __INTEL_COMPILER = 1210, __INTEL_COMPILER_BUILD_DATE = 20111011, __EDG_VERSION__ = 403 # ENVIRONMENT: pentiumpro running linux-elf (Fedora release 17 (Beefy Miracle) (3.5.0-2.fc17.x86_64)) with glibc 2.15 # FILE: 22.locale.numpunct.mt.cpp # COMPILED: Sep 6 2012, 20:50:13 # COMMENT: thread safety # +---+--+--+--+ # | DIAGNOSTIC| ACTIVE | TOTAL | INACTIVE | # +---+--+--+--+ # | (S1) INFO | 11 | 11 | 0% | # | (S2) NOTE |1 |1 | 0% | # | (S8) ERROR|0 |3 | 100% | # | (S9) FATAL|0 |1 | 100% | # +---+--+--+--+ real 1035.05 user 1191.76 sys 63.49 --Stefan -- And now I see with eye serene The very pulse of the machine.
Re: A question (or two) of procedure, etc.
On 09/07/12 10:54, Martin Sebor wrote: We should remember that there are a number of Jira issues that we fixed on 4.2.x but haven't merged out to 4.3.x or trunk. The idea behind the current process (4.2.x - 4.3.x - trunk) was to be able to simply merge the branches in bulk, as opposed to an fix at a time. Unfortunately, we ran into some Subversion issues that made it a huge pain. IIRC, Travis did it at least once so he might still remember the details. That would be very helpful to know. Going forward, to avoid this mess, I would suggest we make an effort to commit fixes wherever necessary at the same time instead of delaying it until some time in the future. Got it. Liviu
dbx [was: Re: STDCXX-1056 [was: Re: STDCXX forks]]
On 09/07/12 11:58, Stefan Teleman wrote: On Fri, Sep 7, 2012 at 8:52 AM, Liviu Nicoara nikko...@hates.ms wrote: On 09/06/12 19:54, Martin Sebor wrote: Also, does the 27.objects test pass with this patch? No, it does not. It hangs at the first insertion, line 227. Unfortunately, I cannot debug it because dbx does not work properly in my installation. What are the symptoms with dbx mishbehaving? (maybe I can help). Thanks! I get this when launching the debugger: $ dbx -xexec32 t For information about new features see `help changes' To remove this message, put `dbxenv suppress_startup_message 7.9' in your .dbxrc Reading t Reading ld-linux.so.2 dbx: fetch at 0xf400 failed -- Input/output error dbx: warning: could not put in breakpoint dbx: warning: internal handler (-80) made defunct -- could not enable event BPT dbx: warning: internal handler (-86) made defunct -- could not enable event BPT ... Then the program just runs from the get go. If I set breakpoints and rerun, it ignores them. Also, long running programs cannot be broken into with C-c: ^Cdbx: warning: Interrupt ignored but forwarded to child. signal INT in (unknown) at 0xf77cf6c4 0xf77cf6c4: movl %edx,%eax dbx: warning: 'stop' ignored -- while doing rtld handshake terminating signal 2 SIGINT dbx: warning: 'stop' ignored -- while doing rtld handshake (dbx) where dbx: program is not active (dbx) Liviu
Re: STDCXX-1056 [was: Re: STDCXX forks]
On 09/05/12 23:51, Martin Sebor wrote: On 09/05/2012 09:03 PM, Stefan Teleman wrote: [...] Agreed. But: if the choice is between an implementation which [1] breaks ABI and [2] performs 20% worse -- even in contrived test cases -- than another implementation [2] which doesn't break ABI, and performs better than the first one, why would we even consider the first one? Breaking the ABI is not an option (unless we rev the version). But I'm not sure I understand what you think breaks the ABI. I think Stefan is referring to adding a mutex member variable to the facet in question and breaking binary compatibility. If that is the case I have confused things when I suggested exactly that, earlier. A cursory read through the __rw_facet source shows that inherits from __rw_synchronized in MT builds, therefore each facet carries its own mutex member. Liviu We don't need to add a new mutex -- we can use the __rw_facet member for the locking. Or did you mean something else? Martin --Stefan -- And now I see with eye serene The very pulse of the machine.
A question (or two) of procedure, etc.
What is the latest policy in what regards trivial fixes, e.g., the volatile qualifier for the max var in LIMITS.cpp we discussed earlier, etc.? It seems excessive to create a bug report for such issues. Also, IIUC from reading previous discussions, forward and backward binary compatible changes go in 4.2.x, followed by merges to 4.3.x and trunk. Am I getting this right? Also, besides the Linux, FreeBSD, Windows, Solaris builds hosted on Apache (Jenkins) is anybody building on HP-UX, AIX, etc.? Thanks. Liviu
Re: A question (or two) of procedure, etc.
On 09/06/12 14:37, Wojciech Meyer wrote: Liviu Nicoara nikko...@hates.ms writes: What is the latest policy in what regards trivial fixes, e.g., the volatile qualifier for the max var in LIMITS.cpp we discussed earlier, etc.? It seems excessive to create a bug report for such issues. [...] So I vote for keeping the changes as lightweight as possible, and avoid extra bureaucracy if it makes sense. This assumption is based that developers here trust their selves, and run the tests often. I'm not subscribed to the commit list here, but if I do - I usually follow people's changes and assume that developers do read commit lists. Makes sense. Thanks. So the general consensus from my experience with other project was: not sure - ask. That's my experience, also I don't have full rights to express my opinion right now about stdcxx. I sure hope we can have totally open (civilized) discussions going forward. :) Also, IIUC from reading previous discussions, forward and backward binary compatible changes go in 4.2.x, followed by merges to 4.3.x and trunk. Am I getting this right? Every project has certain branch strategy, I'm not sure about this so maybe Martin can advice. I prefer to develop on trunk and cherry pick to the other branches avoiding bulk merges (and that's in both directions). Right... I saw some discussions from a while back about active development on 4.2.x with integration to other branches as dictated by compatibility (e.g., integrating 4.2.x - 4.3.x and 4.2.x - 4.2.1), and reintegration to trunk as needed. I am not looking to change any such policy just wanna make sure I am not messing something up. Also, besides the Linux, FreeBSD, Windows, Solaris builds hosted on Apache (Jenkins) is anybody building on HP-UX, AIX, etc.? Thanks. Liviu
Re: STDCXX-1056 [was: Re: STDCXX forks]
On Sep 5, 2012, at 4:03 PM, Martin Sebor wrote: On 09/05/2012 01:33 PM, Liviu Nicoara wrote: On 09/05/12 15:17, Martin Sebor wrote: On 09/05/2012 12:40 PM, Liviu Nicoara wrote: On 09/05/12 14:09, Stefan Teleman wrote: On Wed, Sep 5, 2012 at 10:52 AM, Martin Sebor mse...@gmail.com wrote: [...] OK so I did a little bit of testing, after looking at the *right* __rw_guard class. :-) I changed the std::numpunct class thusly: [...] I am afraid this would be unsafe, too (if I said otherwise earlier I was wrong). [...] Thoughts? You're right, there's still a problem. We didn't get the double checked locking quite right. I wish for simplicity: eliminate the lazy initialization, and fully initialize the facet in the constructor. Then we'd need no locking on copying its data (std::string takes care of its own copying). I'm not sure how easily we can do that. Almost all of locale is initialized lazily. Some of the layers might depend on the facets being initialized lazily as well. This was a deliberate design choice. One of the constraints was to avoid dynamic initialization or allocation at startup. [...] There would be a performance degradation. IMHO, it would be minor and would simplify the code considerably. After finally being able to reproduce the defect with SunPro 12.3 on x86_64 I tried to remove the lazy initialization and a (smaller) test case now passes. I am attaching the program and the numpunct patch. Will continue to look into it. Although STDCXX does not have an implementation of the atomics library we could probably use the existing, unexposed, atomics API to implement a more robust lazy initialization. Liviu $ cat tests/localization/t.cpp; svn diff #include locale #include cstdio #include cstdlib #include unistd.h #define MAX_THREADS16 #define MAX_LOOPS1000 static bool hold = true; extern C { static void* f (void* pv) { std::numpunctchar const fac = *reinterpret_cast std::numpunctchar* (pv); while (hold) ; for (int i = 0; i != MAX_LOOPS; ++i) { std::string const grp = fac.grouping (); } return 0; } } int main (int argc, char** argv) { std::locale const loc = std::locale (argv [1]); std::numpunctchar const fac = std::use_facetstd::numpunctchar (loc); pthread_t tid [MAX_THREADS] = { 0 }; for (int i = 0; i MAX_THREADS; ++i) { if (pthread_create (tid + i, 0, f, (void*)fac)) exit (-1); } sleep (1); hold = false; for (int i = 0; i MAX_THREADS; ++i) { if (tid [i]) pthread_join (tid [i], 0); } return 0; } Index: include/loc/_numpunct.h === --- include/loc/_numpunct.h (revision 1381575) +++ include/loc/_numpunct.h (working copy) @@ -61,24 +61,40 @@ struct numpunct: _RW::__rw_facet string_type; _EXPLICIT numpunct (_RWSTD_SIZE_T __ref = 0) -: _RW::__rw_facet (__ref), _C_flags (0) { } +: _RW::__rw_facet (__ref) { +_C_grouping = do_grouping (); +_C_truename = do_truename (); +_C_falsename = do_falsename (); +_C_decimal_point = do_decimal_point (); +_C_thousands_sep = do_thousands_sep (); +} virtual ~numpunct () _RWSTD_ATTRIBUTE_NOTHROW; // 22.2.3.1.1, p1 -char_type decimal_point () const; +char_type decimal_point () const { +return _C_decimal_point; +} // 22.2.3.1.1, p2 -char_type thousands_sep () const; +char_type thousands_sep () const { +return _C_thousands_sep; +} // 22.2.3.1.1, p3 -string grouping () const; +string grouping () const { +return _C_grouping; +} // 22.2.3.1.1, p4 -string_type truename () const; +string_type truename () const { +return _C_truename; +} // 22.2.3.1.1, p4 -string_type falsename () const; +string_type falsename () const { +return _C_falsename; +} static _RW::__rw_facet_id id; @@ -112,15 +128,13 @@ protected: private: -int _C_flags; // bitmap of cached data valid flags -string _C_grouping;// cached results of virtual members +string _C_grouping; string_type _C_truename; string_type _C_falsename; char_type _C_decimal_point; char_type _C_thousands_sep; }; - #ifndef _RWSTD_NO_SPECIALIZED_FACET_ID _RWSTD_SPECIALIZED_CLASS @@ -134,95 +148,6 @@ _RW::__rw_facet_id numpunctwchar_t::id # endif // _RWSTD_NO_WCHAR_T #endif // _RWSTD_NO_SPECIALIZED_FACET_ID - -template class _CharT -inline _TYPENAME numpunct_CharT::char_type -numpunct_CharT::decimal_point () const -{ -if (!(_C_flags _RW::__rw_dp)) { - -numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this); - -// [try to] get the decimal point first (may throw) -// then set a flag to avoid future
Re: STDCXX-1056 [was: Re: STDCXX forks]
On 09/05/12 00:51, Stefan Teleman wrote: On Tue, Sep 4, 2012 at 10:49 PM, Martin Sebor mse...@gmail.com wrote: template class _CharT inline string numpunct_CharT::grouping () const { if (!(_C_flags _RW::__rw_gr)) { numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this); _RWSTD_MT_GUARD (_C_mutex); // [try to] get the grouping first (may throw) // then set a flag to avoid future initializations __self-_C_grouping = do_grouping (); __self-_C_flags|= _RW::__rw_gr; } return _C_grouping; } That's what I wanted to do originally - use a per-object mutex. FWIW, it is pretty obvious to me _now_ that these assignments are not MT-safe, by themselves and also in the context of the test guarding them. You're right, access must be synchronized on a per-facet object basis. Since the data that needs to be protected on assignment is instance data, the mutex used in the guard must be a (yet to be added) instance mutex. That would make it binary incompatible and would go in ... 4.3.x? This works: template class _CharT inline string numpunct_CharT::grouping () const { if (!(_C_flags _RW::__rw_gr)) { numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this); _RWSTD_MT_STATIC_GUARD (_Type); It seems to me that a static guard is excessive. The only thing that needs guarding is the actual assignment to facet instance members -- a mutex instance member would suffice. And even so, this is still not thread-safe: Two different threads [ T1 and T2 ], seeking two different locales [en_US.UTF-8 and ja_JP.UTF-8], enter std::numpunct_CharT::grouping() at the same time - because they are running on two different cores. They both test for if (!(_C_flags _RW::__rw_gr)) and then -- assuming the expression above evaluates to true -- one of them wins the mutex [T1], and the other one [T2] blocks on the mutex. When T1 is done and exits the function, the grouping is set to en_US.UTF-8 and the mutex is released. Now T2 acquires the mutex, and proceeds to setting grouping to ja-JP.UTF-8. Woe on the caller running from T1 who now thinks he got en_US.UTF-8, but instead he gets ja_JP.UTF-8, which was duly set so by T2, but T1 had no clue about it (remember, the std::string grouping _charT buffer is shared by the caller from T1 and the caller from T2). I am pretty sure that in your example, they would operate on two different instances of the facet class. The static guard would synchronize their running alright but through two different facet instances, copying data from two different places of the locale database. In this respect a static guard is excessive. I would defer it to Martin to validate a course of action but to me it looks like the only problem is the concurrent assignment to facet instance member variables inside the facet member functions. Which could be easily and nicely solved with a plain guard on a mutex ordinary member variable. Thanks! Liviu
Re: STDCXX-1056 [was: Re: STDCXX forks]
On 09/05/12 14:09, Stefan Teleman wrote: On Wed, Sep 5, 2012 at 10:52 AM, Martin Sebor mse...@gmail.com wrote: [...] OK so I did a little bit of testing, after looking at the *right* __rw_guard class. :-) I changed the std::numpunct class thusly: [...] And then: template class _CharT inline string numpunct_CharT::grouping () const { if (!(_C_flags _RW::__rw_gr)) { _RWSTD_MT_GUARD (_C_object_mutex); // double-test here to avoid re-writing an already written string if (!(_C_flags _RW::__rw_gr)) { numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this); // [try to] get the grouping first (may throw) // then set a flag to avoid future initializations __self-_C_grouping = do_grouping (); __self-_C_flags|= _RW::__rw_gr; } } return _C_grouping; } I am afraid this would be unsafe, too (if I said otherwise earlier I was wrong). The compiler might re-arrange the protected assignments, such that another thread sees a partially updated object, where the flags are updated and the string not. I don't think we're going to get away with this here without either a simpler and more inefficient top-level locking, or doing away completely with the lazy initialization. Thoughts? Liviu