Re: Status

2013-04-15 Thread Liviu Nicoara

On 04/15/13 09:35, Jim Jagielski wrote:

Just a quick Email... the list has been pretty quiet
lately and just pinging to see who's still watching
and reading :)



Still here.


Re: [PATCH] STDCXX-1073

2012-10-28 Thread Liviu Nicoara

On 10/25/12 11:31, Martin Sebor wrote:

There are comments suggesting that calling do_transform()
on the whole string may be suboptimal. Intuitively it makes
sense that calling wcscoll() (in a loop, on the NUL-terminated
substrings, if necessary) should be faster than simply calling
do_transform() followed by wstring::compare(), but it would
make sense to confirm the hypothesis before implementing the
optimization.


Alright, same files, two more patches.

2012-10-28  Liviu Nicoara  lnico...@apache.org

Fixes to collate facet and test enhancements:

* src/collate.cpp (__rw_strnxfrm): corrected processing of
  embedded NULs. (__rw_wcsnxfrm) same (duplicated code).
  (collate_bynamewchar_t::do_compare): fixed string comparison
  return values, re-implemented the wcscoll-based comparison.
* tests/localization/22.locale.collate.cpp: implemented a simpler
  collation test for strings with embedded NULs.

All existing and new collate tests pass, but I think some more test 
enhancements are required.

Liviu
Index: tests/localization/22.locale.collate.cpp
===
--- tests/localization/22.locale.collate.cpp(revision 1402988)
+++ tests/localization/22.locale.collate.cpp(working copy)
@@ -116,7 +116,7 @@ const char* widen (char* dst, const char
 return dst;
 }
 
-#ifndef _RWSTD_NO_WCHAR_T
+#if !defined (_RWSTD_NO_WCHAR_T)
 
 int c_strcoll (const wchar_t* s1, const wchar_t* s2)
 {
@@ -1029,65 +1029,119 @@ check_hash_eff (const char* charTname)
 
 template class charT
 void
-check_NUL_locale (const char* charTname, const char* locname)
+check_NUL_collate (const char* charTname, const char* locname,
+   const charT* s1, size_t s1_len,
+   const charT* s2, size_t s2_len)
 {
 std::locale loc (locname);
 
-charT s [STR_SIZE];
-gen_str (s, STR_SIZE);
+typedef typename std::collatecharT Collate;
+typedef typename Collate::string_type String;
 
-charT buf [2][STR_SIZE];
+const Collate col = std::use_facetCollate (loc);
 
-std::memcpy (buf [0], s, sizeof s);
-std::memcpy (buf [1], s, sizeof s);
+const String x1 = col.transform (s1, s1 + s1_len);
+const String x2 = col.transform (s2, s2 + s2_len);
 
-//
-// Verify that first buffer compares more:
-// |0| = buf [0]
-// |0| = buf [1]
-// 
-buf [0][4] = charT ();
-buf [1][3] = charT ();
+const int colcmp = col.compare (s1, s1 + s1_len, s2, s2 + s2_len);
 
-typedef std::collatecharT Collate;
+int lexcmp = x1.compare (x2);
+lexcmp = lexcmp  -1 ? -1 : 1  lexcmp ? 1 : lexcmp;
+
+rw_assert (colcmp == lexcmp, __FILE__, __LINE__,
+   collate%s::compare (%{*.*Ac}, %{*.*Ac}) = %d, 
+   lexicographical comparison of transformed strings = %d, 
+   mismatch in locale (\%s\), charTname,
+   sizeof (charT), s1_len, s1, 
+   sizeof (charT), s2_len, s2, 
+   colcmp, lexcmp, locname);
+
+const bool eq = 
+std::string (s1, s1 + s1_len) == 
+std::string (s2, s2 + s2_len);
+
+rw_assert (bool (colcmp) != eq, __FILE__, __LINE__,
+   collate%s::compare (%{*.*Ac}, %{*.*Ac}) = %d, 
+   lexicographical compare = %s, mismatch in locale (\%s\),
+   charTname,
+   sizeof (charT), s1_len, s1, 
+   sizeof (charT), s2_len, s2, colcmp,
+   (eq ? true : false), locname);
+}
 
-const Collate col = std::use_facetCollate (loc);
+static void
+check_NUL_collate (const char* charTname, const char* locname, char)
+{
+#define T(s, t) \
+check_NUL_collate (charTname, locname,  \
+   s, sizeof s / sizeof *s - 1, \
+   t, sizeof t / sizeof *t - 1)
+
+T (, );
+T (, \0);
+T (, \0\0);
+T (\0, );
+T (\0, \0);
+T (\0, \0\0);
+T (a, \0);
+T (a, \0a);
+T (a, a\0);
+T (a, a\0\0);
+T (a\0, a);
+T (a\0, a\0);
+T (a\0, a\0\0);
+T (\0a, );
+T (\0a, \0);
+T (\0a, \0a);
+T (\0a, \0a\0);
+T (a\0\0b, );
+T (a\0\0b, a);
+T (a\0\0b, ab);
+T (a\0\0b, a\0);
+T (a\0\0b, a\0\0);
+T (a\0\0b, a\0b);
+T (a\0\0b, a\0\0b);
+}
 
-int cmp = col.compare (
-buf [0], buf [0] + sizeof buf [0] / sizeof *buf [0], 
-buf [1], buf [1] + sizeof buf [1] / sizeof *buf [1]);
-
-rw_assert (cmp  0, __FILE__, __LINE__,
-   collate%s::compare (%{*.*Ac}, %{*.*Ac}) 
- 0, failed in locale (\%s\), charTname,
-   sizeof (charT), sizeof buf [0] / sizeof *buf [0], buf [0],
-   sizeof (charT), sizeof buf [1] / sizeof *buf [1], buf [1],
-   locname);
-
-std::memcpy (buf [0], s, sizeof s);
-std::memcpy (buf [1], s, sizeof s);
-
-//
-// Verify that first compare less

Re: Fwd: Re: STDCXX-1071 numpunct facet defect

2012-10-27 Thread Liviu Nicoara

On 10/26/12 18:50, Martin Sebor wrote:

On 10/26/2012 06:50 AM, Liviu Nicoara wrote:

[...]
tl;dr: removing the facet data cache is a priority. All else can be put
on the back-burner.
[...]
// facet data accessor
...
if (0 == _C_impsize) {   // 1
mutex_lock ();
if (_C_impsize)
return _C_data;
_C_data = get_facet_data (); // 2
??   // 3
_C_impsize = 1;  // 4
mutex_unlock ();
}
??   // 5
return _C_data;  // 6
[...]
Various compilers provide these features in various forms, but at the
moment we don't have a unified STDCXX API to implement this.



[...]
I suggested moving the body of the outer if from the header
into a .cpp file in the library where we could implement
ugly, bloated locking without the risk of breaking things
if we removed/replaced it in the future. That's when we ran
into questions about how exactly to do this cleanly, etc.
It didn't seem to be doable very cleanly but I still think
it's a viable approach.


Just making sure we are talking about the same thing. My argument (and the code 
oversimplification above) was about the initialization of the std::facet data, 
not the cached numpunct data. I think we should get rid of the numpunct cache 
right away because it makes the facet unusable in MT builds.

As for the DCII I was talking about in the previous post, I would tackle it but 
I only have access to x86(_64). Does the foundation have shared dev machines?

Liviu


Re: Fwd: Re: STDCXX-1071 numpunct facet defect

2012-10-26 Thread Liviu Nicoara

On 10/03/12 11:10, Martin Sebor wrote:

[...]
I was just thinking of a few simple loops along the lines of:

   void* thread_func (void*) {
   for (int i = 0; i  N; ++)
   test 1: do some simple stuff inline
   test 2: call a virtual function to do the same stuff
   test 3: lock and unlock a mutex and do the same stuff
   }

Test 1 should be the fastest and test 3 the slowest. This should
hold regardless of what simple stuff is (eventually, even when
it's getting numpunct::grouping() data).


tl;dr: removing the facet data cache is a priority. All else can be put 
on the back-burner.


Conflicting test results aside, there still is the case of the incorrect 
handling of the cached data in the facet. I don't think there is a 
disagreement on that. Considering that the std::string is moving in the 
direction of dropping the handle-body implementation, simply getting rid 
of the cache is a step in the same direction.


I think that we should preserve the lock-free reading of the facet data, 
as a benign race, but making it benign is perhaps more complicated than 
previously suggested.


As a reminder, the core of the facet access and initialization code 
essentially looks like this (pseudocode-ish):



// facet data accessor
...
if (0 == _C_impsize) {  // 1
mutex_lock ();
if (_C_impsize)
return _C_data;
_C_data= get_facet_data (); // 2
??  // 3
_C_impsize = 1; // 4
mutex_unlock ();
}
??  // 5
return _C_data; // 6
...


with question marks for missing, necessary fixes. The compiler needs to 
be prevented from re-ordering both 2-4 and 1-6. Just for the sake of 
argument I can imagine an optimization that reorders the reads in 1-6:


register x = _C_data;
if (_C_impsize)
return x;

and if the loads are executed in this order, the caller will see a stale 
_C_data.


First, the 2-4 writes need to be executed in the program order. This 
needs both a compiler barrier and a store-store memory barrier that will 
keep the writes ordered.


Then, the reads in 1-6 need to be ordered such that _C_data is read 
after _C_impsize, via a compiler barrier and a load-load memory barrier 
that will preserve the program order of the loads.


Various compilers provide these features in various forms, but at the 
moment we don't have a unified STDCXX API to implement this.


Of course, I might be wrong. Input is appreciated.

Thanks,
Liviu



Re: Fwd: Re: STDCXX-1071 numpunct facet defect

2012-10-21 Thread Liviu Nicoara

On 10/03/12 11:10, Martin Sebor wrote:

[...]
I was just thinking of a few simple loops along the lines of:



tl;dr: I consider the results of the multi-threaded performance tests (12S, 
Intel/AMD multicores) as coming from heavy contention in copying of 
reference-counted std::string objects.

The long version: I went back to the original test case (attached, t.cpp). As a 
reminder, the program is a simplification of the library numpunct facet 
creation process, which can be compiled either with data caching enabled or 
disabled.

I ran the perf tool (https://perf.wiki.kernel.org/index.php/Main_Page) on the 
program, and I noticed marked differences in the number of instructions and 
cycles (an order of magnitude) and also in the number of CPU-migrations and 
context-switches (way larger for the cached version):

 Performance counter stats for './t':

  121168.026301  task-clock-msecs #  3.778 CPUs
6133377  context-switches #  0.051 M/sec
  35457  CPU-migrations   #  0.000 M/sec
298  page-faults  #  0.000 M/sec
   338006437277  cycles   #   2789.568 M/sec
   136554855096  instructions #  0.404 IPC
31298346444  branches #258.305 M/sec
 2305761092  branch-misses#  7.367 %
 1592157186  cache-references # 13.140 M/sec
1729755  cache-misses #  0.014 M/sec

   32.070381363  seconds time elapsed

vs.

 Performance counter stats for './t':

   16194.235282  task-clock-msecs #  3.883 CPUs
   7491  context-switches #  0.000 M/sec
 47  CPU-migrations   #  0.000 M/sec
311  page-faults  #  0.000 M/sec
45210281987  cycles   #   2791.752 M/sec
76784904688  instructions #  1.698 IPC
18891711522  branches #   1166.570 M/sec
 3162251552  branch-misses# 16.739 %
3332839  cache-references #  0.206 M/sec
 119424  cache-misses #  0.007 M/sec

4.170904279  seconds time elapsed


Then, I have run more detailed reports on the runs. I noticed that the report 
for the cached version contains what appears to be thread synchronization 
overhead, whereas the non-cached version doesn't:

# Events: 134K cycles
#
# Overhead  Command   Shared Object   Symbol
#   ...  ..  ...
#
50.69%t  [kernel.kallsyms]   [k] _raw_spin_lock
 5.46%t  libpthread-2.13.so  [.] pthread_mutex_lock
 4.16%t  libpthread-2.13.so  [.] __pthread_mutex_unlock
 3.65%t  libpthread-2.13.so  [.] __lll_lock_wait
 3.35%t  [kernel.kallsyms]   [k] copy_user_generic_string
 3.09%t  40395b  [.]   40395b
 2.91%t  [kernel.kallsyms]   [k] hash_futex
 2.87%t  [kernel.kallsyms]   [k] futex_wake
 2.67%t  t   [.] f
 2.37%t  [kernel.kallsyms]   [k] futex_wait_setup
 1.61%t  [kernel.kallsyms]   [k] system_call
 1.00%t  [kernel.kallsyms]   [k] system_call_after_swapgs
 0.98%t  [kernel.kallsyms]   [k] __switch_to
 0.96%t  libpthread-2.13.so  [.] __lll_unlock_wake
 0.95%t  [kernel.kallsyms]   [k] get_futex_key
 0.92%t  libc-2.13.so[.] __strlen_sse42
 [...]


vs.


# Events: 15K cycles
#
# Overhead  Command   Shared Object 

 Symbol
#   ...  ..  
..
#
49.55%   :295717f1ed3830a86  [.] 7f1ed3830a86
 8.67%t  libc-2.13.so[.] __malloc
 8.03%t  libc-2.13.so[.] cfree
 4.88%t  t   [.] std::basic_stringchar, std::char_traitschar, 
std::allocatorchar ::basic_string(char const*, std::allocatorchar const)
 4.67%t  libc-2.13.so[.] __strlen_sse42
 4.14%t  libpthread-2.13.so  [.] __pthread_mutex_unlock
 3.62%t  libpthread-2.13.so  [.] pthread_mutex_lock
 2.64%t  t   [.] f
 2.40%t  libc-2.13.so[.] _int_malloc
 2.10%t  libc-2.13.so[.] _int_free
 1.53%t  libpthread-2.13.so  [.] __pthread_mutex_init_internal
 1.33%t  t   [.] operator new(unsigned long)
 1.08%t  t   [.] X::numpunct::do_grouping() const
 1.04%t  libc-2.13.so[.] __memcpy_sse2
 

Re: [PATCH] STDCXX-1073

2012-10-21 Thread Liviu Nicoara

On 10/21/12 19:08, Martin Sebor wrote:

There's no requirement that embedded NULs must be preserved
(that's just how it happens to be implemented). I find it best
to avoid relying on the knowledge of implementation details
when exercising the library so that tests don't start failing
after a conforming optimization or some such tweak is added
to the code.


That's right. I'll improve it.

Thanks for the review.

Liviu


std::string reference counting performance

2012-10-21 Thread Liviu Nicoara

Hi all,

I ran the attached test case, s.cpp, and timed it:

* Deep copy

$ for f in 16 8 6 4 2 1; do time ./s $f 1; done

16  0m39.292s   2m34.381s   0m0.035s
8   0m20.573s   1m18.131s   0m0.025s
6   0m15.037s   0m58.269s   0m0.009s
4   0m9.973s0m38.657s   0m0.009s
2   0m8.750s0m17.473s   0m0.016s
1   0m8.499s0m8.495s0m0.002s

* Shallow copy

$ for f in 16 8 6 4 2 1; do time ./s $f 1; done

16  5m29.311s   4m2.107s17m39.634s
8   2m43.635s   2m1.823s8m50.454s
6   2m10.601s   2m2.316s6m18.462s
4   1m20.766s   1m35.383s   3m29.467s
2   0m25.146s   0m27.775s   0m22.408s
1   0m3.557s0m3.555s0m0.000s

The results show a counter-intuitive better performance for deep-copied 
strings. Again, there is the reversal of performance for one-threaded 
programs where there is no contention.


Liviu
#include iostream
#include locale

#include cstdio
#include cstdlib
#include cstring

#include pthread.h
#include unistd.h

#define MAX_THREADS 128

static long nloops = 1000, nthreads = 16;
static bool volatile pwait = true;



extern C {

static void* 
f (void* pv)
{
std::string const s = *reinterpret_cast std::string*  (pv);

volatile unsigned long n = 0;

while (pwait) ;

for (int i = 0; i  nloops; ++i) {
#if !defined (NO_DEEP_COPY)
const std::string tmp (s.c_str ());
#else
const std::string tmp = s;
#endif // NO_DEEP_COPY

n += strlen (tmp.c_str ());
}

return (void*)n;
}

} // extern C

int
main (int argc, char** argv)
{
switch (argc) {
case 3:
nloops = atol (argv [2]);
case 2:
nthreads = atol (argv [1]);
break;
}

pthread_t tid [MAX_THREADS] = { 0 };

if (nthreads  MAX_THREADS)
nthreads = MAX_THREADS;

printf (%ld, %ld\n, nthreads, nloops);

pthread_setconcurrency (nthreads);

std::string s (Hello, world.);

for (int i = 0; i  nthreads; ++i) {
if (pthread_create (tid + i, 0, f, s))
exit (-1);
}

usleep (50);
pwait = false;

for (int i = 0; i  nthreads; ++i) {
if (tid [i])
pthread_join (tid [i], 0);
}

return 0;
}


* Deep-copy

$ for f in 16 8 6 4 2 1; do time ./s $f 1; done

16  0m39.292s   2m34.381s   0m0.035s
8   0m20.573s   1m18.131s   0m0.025s
6   0m15.037s   0m58.269s   0m0.009s
4   0m9.973s0m38.657s   0m0.009s
2   0m8.750s0m17.473s   0m0.016s
1   0m8.499s0m8.495s0m0.002s

* Ref-counted

16  5m29.311s   4m2.107s17m39.634s
8   2m43.635s   2m1.823s8m50.454s
6   2m10.601s   2m2.316s6m18.462s
4   1m20.766s   1m35.383s   3m29.467s
2   0m25.146s   0m27.775s   0m22.408s
1   0m3.557s0m3.555s0m0.000s


Re: std::string reference counting performance

2012-10-21 Thread Liviu Nicoara

On 10/21/12 21:11, Liviu Nicoara wrote:

Hi all,

I ran the attached test case, s.cpp, and timed it:


Gosh darn it, attached the wrong files. Here are the right ones.

Liviu


* Deep-copy

$ for f in 16 8 6 4 2 1; do time ./s $f 1; done

16  0m39.292s   2m34.381s   0m0.035s
8   0m20.573s   1m18.131s   0m0.025s
6   0m15.037s   0m58.269s   0m0.009s
4   0m9.973s0m38.657s   0m0.009s
2   0m8.750s0m17.473s   0m0.016s
1   0m8.499s0m8.495s0m0.002s

* Ref-counted

16  5m29.311s   4m2.107s17m39.634s
8   2m43.635s   2m1.823s8m50.454s
6   2m10.601s   2m2.316s6m18.462s
4   1m20.766s   1m35.383s   3m29.467s
2   0m25.146s   0m27.775s   0m22.408s
1   0m3.557s0m3.555s0m0.000s
#include iostream
#include locale

#include cstdio
#include cstdlib
#include cstring

#include pthread.h
#include unistd.h

#define MAX_THREADS 128

static long nloops = 1000, nthreads = 16;
static bool volatile pwait = true;



extern C {

static void* 
f (void* pv)
{
std::string const s = *reinterpret_cast std::string*  (pv);

volatile unsigned long n = 0;

while (pwait) ;

for (int i = 0; i  nloops; ++i) {
#if !defined (NO_DEEP_COPY)
const std::string tmp (s.c_str ());
#else
const std::string tmp = s;
#endif // NO_DEEP_COPY

n += strlen (tmp.c_str ());
}

return (void*)n;
}

} // extern C

int
main (int argc, char** argv)
{
switch (argc) {
case 3:
nloops = atol (argv [2]);
case 2:
nthreads = atol (argv [1]);
break;
}

pthread_t tid [MAX_THREADS] = { 0 };

if (nthreads  MAX_THREADS)
nthreads = MAX_THREADS;

printf (%ld, %ld\n, nthreads, nloops);

pthread_setconcurrency (nthreads);

std::string s (Hello, world.);

for (int i = 0; i  nthreads; ++i) {
if (pthread_create (tid + i, 0, f, s))
exit (-1);
}

usleep (50);
pwait = false;

for (int i = 0; i  nthreads; ++i) {
if (tid [i])
pthread_join (tid [i], 0);
}

return 0;
}



Re: std::string reference counting performance

2012-10-21 Thread Liviu Nicoara

On 10/21/12 21:44, Liviu Nicoara wrote:

On 10/21/12 21:11, Liviu Nicoara wrote:

Hi all,

I ran the attached test case, s.cpp, and timed it:


Gosh darn it, attached the wrong files. Here are the right ones.


Made the same mistake twice. My apologies for the clutter. Here are the 
right ones.


Liviu

* Deep-copy

$ for f in 16 8 4 2 1; do time ./s $f 5000; done
16, 50000m9.438s2m15.615s   0m10.799s
 8, 50000m6.973s0m55.579s   0m0.006s
 4, 50000m6.942s0m27.505s   0m0.001s
 2, 50000m6.888s0m13.753s   0m0.002s
 1, 50000m6.800s0m6.799s0m0.001s

* Ref-counted

$ for f in 16 8 4 2 1; do time ./s $f 5000; done
16, 50002m19.098s   2m36.278s   30m38.488s
 8, 50001m47.757s   2m7.278s10m18.418s
 4, 50004m27.425s   5m56.620s   10m13.892s
 2, 50000m17.667s   0m22.560s   0m10.695s
 1, 50000m2.923s0m2.921s0m0.001s

#include iostream
#include locale

#include cstdio
#include cstdlib
#include cstring

#include pthread.h
#include unistd.h

#define MAX_THREADS 128

static long nloops = 1000, nthreads = 16;
static bool volatile pwait = true;



extern C {

static void* 
f (void* pv)
{
std::string const s = *reinterpret_cast std::string*  (pv);

volatile unsigned long n = 0;

while (pwait) ;

for (int i = 0; i  nloops; ++i) {
#if !defined (NO_DEEP_COPY)
const std::string tmp (s.c_str (), s.size ());
#else
const std::string tmp = s;
#endif // NO_DEEP_COPY

n += strlen (tmp.c_str ());
}

return (void*)n;
}

} // extern C

int
main (int argc, char** argv)
{
switch (argc) {
case 3:
nloops = atol (argv [2]);
case 2:
nthreads = atol (argv [1]);
break;
}

pthread_t tid [MAX_THREADS] = { 0 };

if (nthreads  MAX_THREADS)
nthreads = MAX_THREADS;

printf (%ld, %ld\n, nthreads, nloops);

pthread_setconcurrency (nthreads);

std::string s (Hello, world.);

for (int i = 0; i  nthreads; ++i) {
if (pthread_create (tid + i, 0, f, s))
exit (-1);
}

usleep (50);
pwait = false;

for (int i = 0; i  nthreads; ++i) {
if (tid [i])
pthread_join (tid [i], 0);
}

return 0;
}




Re: Performance tests

2012-10-18 Thread Liviu Nicoara

On 10/18/12 10:49, Martin Sebor wrote:

On 10/16/2012 10:45 AM, Liviu Nicoara wrote:

Are there any performance tests (and measurements) of STDCXX features
against similar features in libc?


I don't think so. A long time ago I benchmarked stdcxx iostreams
and stdio on a variety of operating systems but I'm not sure where
those results are. They were interesting, though. On most systems,
stdcxx iostreams was either on par or even faster than stdio. The
one OS I remember not being able to beat was GLIBC.



That is what I remember seeing a long time ago. I looked in my old mailboxes 
but I could not find it.

Liviu


Re: [PATCH] STDCXX-1073

2012-10-18 Thread Liviu Nicoara

On 10/18/12 11:02, Martin Sebor wrote:

On 10/16/2012 10:38 AM, Liviu Nicoara wrote:

I have enhanced the collation test, 22.locale.collate.cpp with a bunch
of cases that deal with embedded strings, inside the input strings as
well as at the edges. More defects became apparent, and they have been
fixed.

I have re-worked the src/collate.cpp patch and I am attaching it here.
All collation tests (old and new) pass. If there are no objections, I
will check it in later in the week.


There are tabs in the patch (we need spaces). They also make
the patch harder to review than it otherwise would be. I would
also suggest keeping the MSVC  8.0 block and cleaning things
up separately. That will also make the patch easier to review.
(I'm so swamped that this matters to me.)


I looked at both (latest) patches and they do not have tabs in them. They were 
sent in my last post on the 16th, the one you replied to. The diff without -w 
is quite a bit verbose because I moved a whole block in both collate.cpp 
functions, inside an if, so there are whitespace differences. svn diff -w gave 
me a quite decent view of the changes without having to looks at the actual 
patched source file.

I am re-attaching the patches, this time created without -w so they would be a 
bit verbose. No tabs this time either.



I haven't looked at the test patch yet. If you could clean up
the tab issue and reduce the amount of unnecessary whitespace
changes I promise to review it this weekend.


The test enhancements are pretty plain, just a bunch of added stuff.
 
Thanks,

Liviu

Index: tests/localization/22.locale.collate.cpp
===
--- tests/localization/22.locale.collate.cpp(revision 1399886)
+++ tests/localization/22.locale.collate.cpp(working copy)
@@ -1029,7 +1029,105 @@ check_hash_eff (const char* charTname)
 
 template class charT
 void
-check_NUL_locale (const char* charTname, const char* locname)
+check_NUL_transform (const char* charTname, const char* locname,
+ const charT* src, size_t src_len,
+ const char*  pat, size_t pat_len)
+{
+std::locale loc (locname);
+
+typedef typename std::collatecharT Collate;
+typedef typename Collate::string_type String;
+
+const Collate col = std::use_facetCollate (loc);
+
+String s = col.transform (src, src + src_len);
+
+for (size_t i = 0, j = 0; i  pat_len; ++i) { 
+
+if (j = s.size ()) {
+rw_assert (false, __FILE__, __LINE__,
+   collate%s::transform (%{*.*Ac}) -  %{*.*Ac}
+   , incomplete transformation (locale \%s\), 
+   charTname, sizeof (charT), src_len, src, 
+   sizeof (charT), s.size (), s.c_str (), locname);
+break;
+}
+
+switch (pat [i]) {
+case 0:
+rw_assert (j  s.size ()  0 == s [j], __FILE__, __LINE__,
+   collate%s::transform (\%{*.*Ac}\) -  \%{*.*Ac}\
+   , expected \\0, got %d in position %lu (locale 
\%s\), 
+   charTname, sizeof (charT), src_len, src, sizeof 
(charT), 
+   s.size (), s.c_str (), s [j], j, locname);
+++j;
+break;
+
+case '*':
+rw_assert (j  s.size ()  s [j], __FILE__, __LINE__,
+   collate%s::transform (\%{*.*Ac}\) -  \%{*.*Ac}\
+   , got NUL in position %lu (locale \%s\), charTname,
+   sizeof (charT), src_len, src, sizeof (charT), s.size 
(), 
+   s.c_str (), j, locname);
+for (; j  s.size ()  s [j]; ++j) ;
+break;
+
+default:
+break;
+}
+}
+}
+
+static void
+check_NUL_transform_narrow (const char* charTname, const char* locname)
+{
+#define T(src, pat) \
+check_NUL_transform (charTname, locname,\
+ src, sizeof src / sizeof *src - 1, \
+ pat, sizeof pat / sizeof *pat - 1)
+
+T (\0,\0   );
+T (\0\0,  \0\0 );
+T (\0\0\0,\0\0\0   );
+T (a\0,   *\0  );
+T (\0a\0, \0*\0);
+T (\0a\0, \0*\0);
+T (a\0\0, *\0\0);
+T (a\0\0\0,   *\0\0\0  );
+T (a\0b,  *\0* );
+T (a\0b\0,*\0*\0   );
+T (a\0b\0\0,  *\0*\0\0 );
+}
+
+static void
+check_NUL_transform_wide (const char* charTname, const char* locname)
+{
+T (L\0,   \0   );
+T (L\0\0, \0\0 );
+T (L\0\0\0,   \0\0\0   );
+T (La\0,  *\0  );
+T (L\0a\0,\0*\0);
+T (L\0a\0,\0*\0);
+T (La\0\0,*\0\0);
+T (La\0\0\0,  *\0\0\0  );
+T (La\0b, *\0* );
+T (La\0b\0,   *\0*\0   );
+T (La\0b\0\0, *\0*\0\0 );
+
+#undef T
+}
+
+template class charT
+void
+check_NUL_transform (const char

Re: [PATCH] STDCXX-1073

2012-10-16 Thread Liviu Nicoara
I have enhanced the collation test, 22.locale.collate.cpp with a bunch 
of cases that deal with embedded strings, inside the input strings as 
well as at the edges. More defects became apparent, and they have been 
fixed.


I have re-worked the src/collate.cpp patch and I am attaching it here. 
All collation tests (old and new) pass. If there are no objections, I 
will check it in later in the week.


Thanks,
Liviu

On 10/13/12 11:16, Liviu Nicoara wrote:

The initial patch does not pass the following test case. Have re-worked
the patch and attached it to the incident, and I am also attaching it
here. It passes all collate tests.




Index: tests/localization/22.locale.collate.cpp
===
--- tests/localization/22.locale.collate.cpp(revision 1397847)
+++ tests/localization/22.locale.collate.cpp(working copy)
@@ -1029,7 +1029,105 @@ check_hash_eff (const char* charTname)
 
 template class charT
 void
-check_NUL_locale (const char* charTname, const char* locname)
+check_NUL_transform (const char* charTname, const char* locname,
+ const charT* src, size_t src_len,
+ const char*  pat, size_t pat_len)
+{
+std::locale loc (locname);
+
+typedef typename std::collatecharT Collate;
+typedef typename Collate::string_type String;
+
+const Collate col = std::use_facetCollate (loc);
+
+String s = col.transform (src, src + src_len);
+
+for (size_t i = 0, j = 0; i  pat_len; ++i) { 
+
+if (j = s.size ()) {
+rw_assert (false, __FILE__, __LINE__,
+   collate%s::transform (%{*.*Ac}) -  %{*.*Ac}
+   , incomplete transformation (locale \%s\), 
+   charTname, sizeof (charT), src_len, src, 
+   sizeof (charT), s.size (), s.c_str (), locname);
+break;
+}
+
+switch (pat [i]) {
+case 0:
+rw_assert (j  s.size ()  0 == s [j], __FILE__, __LINE__,
+   collate%s::transform (\%{*.*Ac}\) -  \%{*.*Ac}\
+   , expected \\0, got %d in position %lu (locale 
\%s\), 
+   charTname, sizeof (charT), src_len, src, sizeof 
(charT), 
+   s.size (), s.c_str (), s [j], j, locname);
+++j;
+break;
+
+case '*':
+rw_assert (j  s.size ()  s [j], __FILE__, __LINE__,
+   collate%s::transform (\%{*.*Ac}\) -  \%{*.*Ac}\
+   , got NUL in position %lu (locale \%s\), charTname,
+   sizeof (charT), src_len, src, sizeof (charT), s.size 
(), 
+   s.c_str (), j, locname);
+for (; j  s.size ()  s [j]; ++j) ;
+break;
+
+default:
+break;
+}
+}
+}
+
+static void
+check_NUL_transform_narrow (const char* charTname, const char* locname)
+{
+#define T(src, pat) \
+check_NUL_transform (charTname, locname,\
+ src, sizeof src / sizeof *src - 1, \
+ pat, sizeof pat / sizeof *pat - 1)
+
+T (\0,\0   );
+T (\0\0,  \0\0 );
+T (\0\0\0,\0\0\0   );
+T (a\0,   *\0  );
+T (\0a\0, \0*\0);
+T (\0a\0, \0*\0);
+T (a\0\0, *\0\0);
+T (a\0\0\0,   *\0\0\0  );
+T (a\0b,  *\0* );
+T (a\0b\0,*\0*\0   );
+T (a\0b\0\0,  *\0*\0\0 );
+}
+
+static void
+check_NUL_transform_wide (const char* charTname, const char* locname)
+{
+T (L\0,   \0   );
+T (L\0\0, \0\0 );
+T (L\0\0\0,   \0\0\0   );
+T (La\0,  *\0  );
+T (L\0a\0,\0*\0);
+T (L\0a\0,\0*\0);
+T (La\0\0,*\0\0);
+T (La\0\0\0,  *\0\0\0  );
+T (La\0b, *\0* );
+T (La\0b\0,   *\0*\0   );
+T (La\0b\0\0, *\0*\0\0 );
+
+#undef T
+}
+
+template class charT
+void
+check_NUL_transform (const char* charTname, const char* locname)
+{
+check_NUL_transform_narrow (charTname, locname);
+check_NUL_transform_wide   (charTname, locname);
+}
+
+template class charT
+void
+check_NUL_collate (const char* charTname, const char* locname)
 {
 std::locale loc (locname);
 
@@ -1090,6 +1188,14 @@ check_NUL_locale (const char* charTname,
 
 template class charT
 void
+check_NUL (const char* charTname, const char* locname)
+{
+check_NUL_transform charT  (charTname, locname);
+check_NUL_collate charT(charTname, locname);
+}
+
+template class charT
+void
 check_NUL (const char* charTname)
 {
 // Verify that the collate facet correctly handles character
@@ -1103,7 +1209,7 @@ check_NUL (const char* charTname)
 for (const char* locname = rw_locales (LC_COLLATE); 
  *locname; locname += std::strlen (locname) + 1, ++i) {
 try {
-check_NUL_localecharT (charTname, locname

Performance tests

2012-10-16 Thread Liviu Nicoara
Are there any performance tests (and measurements) of STDCXX features 
against similar features in libc?


Thanks,
Liviu


Re: [PATCH] STDCXX-970

2012-10-13 Thread Liviu Nicoara

On 10/10/12 07:40, Liviu Nicoara wrote:

The following patch cleans up most of the failures seen in the collation
test. It does not fix the transform failures and libstd tests.

2012-10-10  Liviu Nicoara  lnico...@apache.org

Various fixes:

 * tests/localization/22.locale.collate.cpp: removed unused
   macros, corrected assertions and their formatting. (c_xfrm):
   corrected use of safety buffers. (make_test_locale): used
   rw_create_locale, used the whole portable character
   set. (check_libc): split code, restored previous locales after
   test. (check_NUL): hard-coded positions of NUL characters.


Committed revision 1397847.


Re: [PATCH] STDCXX-1073

2012-10-13 Thread Liviu Nicoara
The initial patch does not pass the following test case. Have re-worked 
the patch and attached it to the incident, and I am also attaching it 
here. It passes all collate tests.


Here is the second test case:

$ cat ../../tests/localization/t.cpp; nice make t.cpp  ./t af_ZA.utf8; 
echo $?

#include iostream
#include locale
#include string

int
main (int argc, char** argv)
{
std::locale loc (argv [1]);

const std::collate char  fac =
std::use_facet std::collate char   (loc);

char const buf [] = a\0\0b;
std::string s = fac.transform (buf, buf + sizeof buf - 1);

size_t i = 0;
for (; i  s.size ()  0 == s [i]; ++i) ;

return !(i == 2);
}
1

On 10/10/12 08:25, Liviu Nicoara wrote:

2012-10-10  Liviu Nicoara  lnico...@apache.org

 * src/collate.cpp (__rw_strnxfrm): preserved embedded NULs



Index: src/collate.cpp
===
--- src/collate.cpp (revision 1397825)
+++ src/collate.cpp (working copy)
@@ -480,113 +480,103 @@
 {
 _STD::string res;
 
-char buf [256];
-char *pbuf = buf;
-
+char buf [256], *pbuf = buf, *psrc = buf;
 size_t bufsize = sizeof buf;
-char *psrc = buf;
 
 while (nchars) {
 
-// using a C-style cast instead of static_cast to avoid
-// a gcc 2.95.2 bug causing an error on some platforms:
-//   static_cast from `void *' to `const char *'
-const char* const last = (const char*)memchr (src, '\0', nchars);
-
-if (0 == last) {
-
-// no NUL found in the initial portion of the source string
-// that fits into the local temporary buffer; copy as many
-// characters as fit into the buffer
+if (src [0]) {
 
-if (bufsize = nchars) {
-if (pbuf != buf)
-delete[] pbuf;
-pbuf = new char [nchars + 1];
+// using a C-style cast instead of static_cast to avoid
+// a gcc 2.95.2 bug causing an error on some platforms:
+//   static_cast from `void *' to `const char *'
+const char* const last = (const char*)memchr (src, '\0', nchars);
+
+if (0 == last) {
+// no NUL found in the initial portion of the source string
+// that fits into the local temporary buffer; copy as many
+// characters as fit into the buffer
+
+if (bufsize = nchars) {
+if (pbuf != buf)
+delete[] pbuf;
+pbuf = new char [nchars + 1];
+}
+
+psrc = pbuf;
+memcpy (psrc, src, nchars);
+
+// append a terminating NUL and decrement the number
+// of characters that remain to be processed
+psrc [nchars] = '\0';
+src  += nchars;
+nchars= 0;
+}
+else {
+// terminating NUL found in the source buffer
+nchars -= (last - src) + 1;
+psrc= _RWSTD_CONST_CAST (char*, src);
+src+= (last - src) + 1;
 }
 
-psrc = pbuf;
-memcpy (psrc, src, nchars);
+#ifdef _RWSTD_OS_SUNOS
+// Solaris 10u5 on AMD64 overwrites memory past the end of
+// just_in_case_buf[8], to avoid this, pass a null pointer
+char* const just_in_case_buf = 0;
+#else
+// provide a destination buffer to strxfrm() in case
+// it's buggy (such as MSVC's) and tries to write to
+// the buffer even if it's 0
+char just_in_case_buf [8];
+#endif // _RWSTD_OS_SUNOS
 
-// append a terminating NUL and decrement the number
-// of characters that remain to be processed
-psrc [nchars] = '\0';
-src  += nchars;
-nchars= 0;
-}
-else {
+const size_t dst_size = strxfrm (just_in_case_buf, psrc, 0);
 
-// terminating NUL found in the source buffer
-nchars -= (last - src) + 1;
-psrc= _RWSTD_CONST_CAST (char*, src);
-src+= (last - src) + 1;
-}
+// check for strxfrm() errors
+if (0 == (dst_size  1)) {
+if (pbuf != buf)
+delete[] pbuf;
 
-#ifdef _RWSTD_OS_SUNOS
-// Solaris 10u5 on AMD64 overwrites memory past the end of
-// just_in_case_buf[8], to avoid this, pass a null pointer
-char* const just_in_case_buf = 0;
-#else
-// provide a destination buffer to strxfrm() in case
-// it's buggy (such as MSVC's) and tries to write to
-// the buffer even if it's 0
-char just_in_case_buf [8];
-#endif
-
-const size_t dst_size = strxfrm (just_in_case_buf, psrc, 0);
-
-// check for strxfrm() errors
-if (0 == (dst_size  1

Re: [PATCH] Re: STDCXX-1072 SPARC V8 mutex alignment requirements

2012-10-13 Thread Liviu Nicoara

Thanks, Travis!

On 10/13/12 15:36, Travis Vitek wrote:


Liviu,

I built the library and tests before and after your change with the 64-bit 
flag, and I saw no differences in the number of failed tests between the 
builds. I've attached the output of 'gmake -k runall' before and after your 
change to STDCXX-1072 in case you want to look over them.

I wrote up the following test case to prove that I could reproduce the issue 
using the compile/link options used to build the stdcxx tests. The code is 
below..

   [vitek@andromeda] 138 % cat t.cpp
   #include thread.h
   #include synch.h
   #include malloc.h

   int main ()
   {
 int *ip;
 mutex_t *mp;

 ip = (int*)malloc(sizeof (int) + sizeof (mutex_t));
 mp = (mutex_t*)(ip + 1);

 mutex_init(mp, USYNC_THREAD | LOCK_ROBUST, 0);
 mutex_lock(mp);
 mutex_unlock(mp);
 mutex_destroy(mp);

 free(ip);
   }
   [vitek@andromeda] 139 % gmake t
   CC -c -D_RWSTDDEBUG -mt -I/amd/homes/vitek/tmp/stdcxx-4.2.x/include \
 -I/build/vitek/stdcxx-4.2.x_sunpro-5.8_sunos-5.10-patched/include \
 -I/amd/homes/vitek/tmp/stdcxx-4.2.x/tests/include \
 -library=%none -g  -m64 +w -errtags -erroff=hidef   t.cpp
   CC t.o -o t -L/build/vitek/stdcxx-4.2.x_sunpro-5.8_sunos-5.10-patched/rwtest 
\
 -lrwtest15S -library=%none -mt -m64 \
 -L/build/vitek/stdcxx-4.2.x_sunpro-5.8_sunos-5.10-patched/lib \
 -lstd15S  -lm
   Bus Error (core dumped)
   [vitek@andromeda] 140 %

I also tested with POSIX mutexes and saw the same behavior.

Travis


From: Liviu Nicoara
Sent: Thursday, October 11, 2012 5:28 AM
To: dev@stdcxx.apache.org
Subject: Re: [PATCH] Re: STDCXX-1072 SPARC V8 mutex alignment requirements

I applied the patch on 4.2.x. If someone with access to a SPARC machine
could give it a runall and post the results here it would be awesome. I
will postpone closing the incident until then.

Thanks!

Liviu







Re: Fwd: Re: STDCXX-1071 numpunct facet defect

2012-10-12 Thread Liviu Nicoara

On 10/04/12 22:41, Liviu Nicoara wrote:

On 10/4/12 10:10 PM, Liviu Nicoara wrote:


On 10/3/12 11:10 AM, Martin Sebor wrote:

On 10/03/2012 07:01 AM, Liviu Nicoara wrote:



   void* thread_func (void*) {
   for (int i = 0; i  N; ++)
   test 1: do some simple stuff inline
   test 2: call a virtual function to do the same stuff
   test 3: lock and unlock a mutex and do the same stuff
   }

Test 1 should be the fastest and test 3 the slowest. This should
hold regardless of what simple stuff is (eventually, even when
it's getting numpunct::grouping() data).


That is expected; I attached test case x.cpp and results-x.txt.

I did not find it too interesting in its own, though. The difference
between the
cached and non-cached data is that in the case of the cached data the
copying of
the string involves nothing more than a bump in the reference counter,
whereas
in the non-cached version a string object is constructed anew, and
memory gets
allocated for its body. Yet, in my measurements, the cached version is
the one
which shows the worse performance.

So, I extracted the std::string class and simplified it down and put
it in
another test case. That would be u.cpp and the results are
results-u.txt. The
results show the same performance trends although the absolute values
have
skewed. Will get back to this after I digest the results a bit more.


This discussion has stagnated lately. I hope to allocate some time this 
week-end to go over the results, re-verify them and analyze the 
disassembly. In the meantime, any thoughts?


Liviu



Re: [PATCH] Re: STDCXX-1072 SPARC V8 mutex alignment requirements

2012-10-11 Thread Liviu Nicoara
I applied the patch on 4.2.x. If someone with access to a SPARC machine 
could give it a runall and post the results here it would be awesome. I 
will postpone closing the incident until then.


Thanks!

Liviu

On 10/06/12 16:56, Liviu Nicoara wrote:

On 09/29/12 15:33, Liviu Nicoara wrote:

On 9/28/12 11:32 AM, Travis Vitek wrote:


[...] I think an even cleaner solution is to switch
to using __rw_aligned_buffer instead. It gives us a single point of
failure for alignment issues like this, and it makes the code
self-documenting and easier to read.


I am attaching another patch here, which makes use of the
__rw_aligned_buffer, slightly more verbose but the code is slightly
cleaner.


If there are no objections, I would check it in sometime after this
week-end.

Liviu




[PATCH] STDCXX-970

2012-10-10 Thread Liviu Nicoara
The following patch cleans up most of the failures seen in the collation 
test. It does not fix the transform failures and libstd tests.


2012-10-10  Liviu Nicoara  lnico...@apache.org

Various fixes:

* tests/localization/22.locale.collate.cpp: removed unused
  macros, corrected assertions and their formatting. (c_xfrm):
  corrected use of safety buffers. (make_test_locale): used
  rw_create_locale, used the whole portable character
  set. (check_libc): split code, restored previous locales after
  test. (check_NUL): hard-coded positions of NUL characters.
Index: tests/localization/22.locale.collate.cpp
===
--- tests/localization/22.locale.collate.cpp(revision 1392832)
+++ tests/localization/22.locale.collate.cpp(working copy)
@@ -43,6 +43,9 @@
 #include rw_locale.h
 #include rw_process.h
 
+#define IGNORE0
+#define STR_SIZE  16
+#define LOCNAME_SIZE  256
 
 #if _RWSTD_PATH_SEP == '/'
 #  define SLASH /
@@ -59,10 +62,6 @@
 #define LOCALE_ROOT RWSTD_LOCALE_ROOT
 const char* locale_root;
 
-#define LC_COLLATE_SRC  LC_COLLATE.src
-#define LC_COLLATE_CM   LC_COLLATE.cm
-#define TEST_LOCALE_NAMEtest.locale
-
 /**/
 
 // These overloads are necessary in our template
@@ -77,18 +76,18 @@
 
 std::size_t c_xfrm (char* to, const char* from, std::size_t size)
 {
-char safety_buf [8];
+char safety_buf [8] = { 0 };
+
 if (0 == to  0 == size) {
 // prevent buggy implementations (such as MSVC 8) from trying
 // to write to the destination buffer even though it's 0 and
 // its size is zero (see stdcxx-69)
 to = safety_buf;
-*to = '\0';
 }
 
 std::size_t n = std::strxfrm (to, from, size);
 
-if (to)
+if (to  to != safety_buf)
 n = std::strlen (to);
 
 return n;
@@ -127,45 +126,42 @@
 
 std::size_t c_xfrm (wchar_t* to, const wchar_t* from, std::size_t size)
 {
+std::size_t n = 0;
+
 #if !defined (_MSC_VER) || _MSC_VER  1200
 
-wchar_t safety_buf [8];
+wchar_t safety_buf [8] = { 0 };
+
 if (0 == to  0 == size) {
 // prevent buggy implementations (such as MSVC 8) from trying
 // to write to the destination buffer even though it's 0 and
 // its size is zero (see stdcxx-69)
 to = safety_buf;
-*to = L'\0';
 }
 
-std::size_t n = std::wcsxfrm (to, from, size);
+n = std::wcsxfrm (to, from, size);
 
-if (to)
+if (to  to != safety_buf)
 n = std::wcslen (to);
 
 #else   // MSVC 6 and prior
 
 // working around an MSVC 6.0 libc bug (PR #26437)
-
 if (to) {
-std::size_t n = std::wcsxfrm (to, from, size);
-
+std::wcsxfrm (to, from, size);
 n = std::wcslen (to);
-
-return n;
 }
+else {
+wchar_t tmp [1024];
 
-wchar_t tmp [1024];
-
-std::size_t n = std::wcslen (from);
-
-_RWSTD_ASSERT (n  sizeof tmp / sizeof *tmp);
-
-std::wcscpy (tmp, from);
-
-std::wcsxfrm (tmp, from, sizeof tmp / sizeof *tmp);
-
-n = std::wcslen (tmp);
+n = std::wcslen (from);
+_RWSTD_ASSERT (n  sizeof tmp / sizeof *tmp);
+
+std::wcscpy (tmp, from);
+std::wcsxfrm (tmp, from, sizeof tmp / sizeof *tmp);
+
+n = std::wcslen (tmp);
+}
 
 #endif   // MSVC 6
 
@@ -226,7 +222,7 @@
 /**/
 
 template class charT
-/*static*/ void
+void
 gen_str (charT* str, std::size_t size)
 {
 // generate a random string with the given size
@@ -234,7 +230,7 @@
 return;
 
 // use ASCII characters in the printable range
-for (std::size_t i = 0; i != size - 1; ++i)
+for (std::size_t i = 0; i  size - 1; ++i)
 str [i] = ' ' + std::rand () % ('~' - ' ');
 
 str [size - 1] = charT ();
@@ -243,152 +239,165 @@
 /**/
 
 template class charT
-/*static*/ void
-check_libc (const char* charTname)
+void
+check_libc_locale (const char* charTname, char const* locname,
+   int (nfail) [3])
 {
-// the libc implementation of the library should act the same as
-// the c-library.  Go through all the locales, generate some random
-// strings and make sure that the following holds true:
-// transform acts like strxfrm and wcsxfrm,
-// compare acts like strcoll and wcscoll
+typedef std::char_traitscharT traits_type;
+typedef std::allocatorcharT allocator_type;
+typedef std::basic_string charT, traits_type, allocator_type string_type;
 
-int nfail [3] = { 0 };
+std::locale loc (locname);
 
-rw_info (0, __FILE__, __LINE__,
- libc std::collate%s::transform (), charTname);
+const std::collatecharT co =
+_STD_USE_FACET (std::collatecharT, loc

[PATCH] STDCXX-1073

2012-10-10 Thread Liviu Nicoara

2012-10-10  Liviu Nicoara  lnico...@apache.org

* src/collate.cpp (__rw_strnxfrm): preserved embedded NULs

Index: src/collate.cpp
===
--- src/collate.cpp (revision 1392832)
+++ src/collate.cpp (working copy)
@@ -547,7 +547,7 @@
 
 _TRY {
 // resize the result string to fit itself plus the result
-// of the transformation including the terminatin NUL
+// of the transformation including the terminating NUL
 // appended by strxfrm()
 res.resize (res_size + dst_size + 1);
 }
@@ -557,36 +557,15 @@
 _RETHROW;
 }
 
-// transfor the source string up to the terminating NUL
-size_t xfrm_size =
-strxfrm (res [0] + res_size, psrc, dst_size + 1);
-
-#if defined _MSC_VER  _MSC_VER  1400
-// compute the correct value that should have been returned from
-// strxfrm() after the transformation has completed (MSVC strxfrm()
-// returns a bogus result; see PR #29935)
-xfrm_size = strlen (res [0] + res_size);
-#endif   // MSVC  8.0
-
-// increment the size of the result string by the number
-// of transformed characters excluding the terminating NUL
-// if strxfrm() transforms the empty string into the empty
-// string, keep the terminating NUL, otherwise drop it
-res_size += xfrm_size + (last  !*psrc  !xfrm_size);
-
-_TRY {
-res.resize (res_size);
-}
-_CATCH (...) {
-if (pbuf != buf)
-delete[] pbuf;
-_RETHROW;
-}
+strxfrm (res [0] + res_size, psrc, dst_size + 1);
 }
 
 if (pbuf != buf)
 delete[] pbuf;
 
+if (!res.empty ())
+res.resize (res.size () - 1);
+
 return res;
 }
 


Re: Fwd: Re: STDCXX-1071 numpunct facet defect

2012-10-04 Thread Liviu Nicoara


On 10/3/12 11:10 AM, Martin Sebor wrote:

On 10/03/2012 07:01 AM, Liviu Nicoara wrote:


I am gathering some more measurements along these lines but it's time
consuming. I estimate I will have some ready for review later today or
tomorrow. In the meantime could you please post your kernel, glibc and
compiler versions?


I was just thinking of a few simple loops along the lines of:

   void* thread_func (void*) {
   for (int i = 0; i  N; ++)
   test 1: do some simple stuff inline
   test 2: call a virtual function to do the same stuff
   test 3: lock and unlock a mutex and do the same stuff
   }

Test 1 should be the fastest and test 3 the slowest. This should
hold regardless of what simple stuff is (eventually, even when
it's getting numpunct::grouping() data).


That is expected; I attached test case x.cpp and results-x.txt.

I did not find it too interesting in its own, though. The difference between the 
cached and non-cached data is that in the case of the cached data the copying of 
the string involves nothing more than a bump in the reference counter, whereas 
in the non-cached version a string object is constructed anew, and memory gets 
allocated for its body. Yet, in my measurements, the cached version is the one 
which shows the worse performance.


So, I extracted the std::string class and simplified it down and put it in 
another test case. That would be u.cpp and the results are results-u.txt. The 
results show the same performance trends although the absolute values have 
skewed. Will get back to this after I digest the results a bit more.


Liviu
-*- mode: org -*-

* iMac, 4x Core i5 , 12S, gcc 4.5.4:

$ nice make u

16, 1   1m18.811s   5m12.329s   0m0.263s
8, 10m39.919s   2m36.198s   0m0.150s
4, 10m20.449s   1m20.797s   0m0.050s
2, 10m9.888s0m19.725s   0m0.005s
1, 10m2.483s0m2.480s0m0.002s

$ nice make CPPOPTS=-DNO_CACHE u

16, 1   0m37.418s   2m27.822s   0m0.872s
8, 10m18.844s   1m14.607s   0m0.261s
4, 10m10.165s   0m40.147s   0m0.023s
2, 10m8.652s0m17.278s   0m0.003s
1, 10m8.482s0m8.473s0m0.007s

$ nice make CPPOPTS=-DNO_VIRTUAL_CALL u

16, 1   1m2.770s4m9.307s0m0.179s
8, 10m31.890s   2m6.792s0m0.087s
4, 10m16.427s   1m5.133s0m0.039s
2, 10m8.497s0m16.981s   0m0.007s
1, 10m2.291s0m2.288s0m0.002s

$ nice make CPPOPTS=-DNO_CACHE -DNO_VIRTUAL_CALL u

16, 1   0m35.838s   2m21.406s   0m0.877s
8, 10m19.007s   1m14.920s   0m0.255s
4, 10m10.099s   0m39.504s   0m0.042s
2, 10m8.599s0m17.190s   0m0.003s
1, 10m8.986s0m8.980s0m0.005s

* Linux Slackware, 16x AMD Opteron, 12S, gcc 4.5.2

$ nice make u



$ nice make CPPOPTS=-DNO_CACHE u


$ nice make CPPOPTS=-DNO_VIRTUAL_CALL u

$ nice make CPPOPTS=-DNO_CACHE -DNO_VIRTUAL_CALL u

-*- mode: org -*-

* iMac, 4x Core i5 , 12S, gcc 4.5.4:

$ nice make CPPOPTS=-DNO_LOCK -DNO_VIRTUAL_CALL u

16, 1   0m7.864s0m30.259s   0m0.035s
 8, 1   0m4.396s0m17.034s   0m0.016s
 4, 1   0m2.729s0m10.473s   0m0.011s
 2, 1   0m2.481s0m4.929s0m0.003s
 1, 1   0m2.461s0m2.455s0m0.002s

$ nice make CPPOPTS=-DNO_LOCK u

16, 1   0m9.724s0m37.455s   0m0.043s
 8, 1   0m5.559s0m20.309s   0m0.048s
 4, 1   0m3.160s0m12.213s   0m0.013s
 2, 1   0m2.872s0m5.694s0m0.004s
 1, 1   0m2.845s0m2.838s0m0.002s

$ nice make u

16, 1   1m3.745s3m58.570s   0m0.351s
 8, 1   0m32.351s   1m55.740s   0m0.203s
 4, 1   0m16.852s   1m1.633s0m0.092s
 2, 1   0m8.419s0m16.699s   0m0.010s
 1, 1   0m4.214s0m4.179s0m0.005s

* Linux Slackware, 16x AMD Opteron, 12S, gcc 4.7.1

$ nice make CPPOPTS=-DNO_LOCK -DNO_VIRTUAL_CALL u

16, 1   0m4.382s1m9.896s0m0.004s
 8, 1   0m4.374s0m34.904s   0m0.002s
 4, 1   0m4.368s0m17.445s   0m0.002s
 2, 1   0m4.366s0m8.720s0m0.003s
 1, 1   0m4.355s0m4.351s0m0.001s

$ nice make CPPOPTS=-DNO_LOCK u

16, 1   0m5.415s1m19.833s   0m0.005s
 8, 1   0m4.939s0m39.438s   0m0.003s
 4, 1   0m4.936s0m19.712s   0m0.001s
 2, 1   0m4.930s0m9.847s0m0.002s
 1, 1   0m4.921s0m4.917s0m0.002s

$ nice make u

16, 1   1m40.769s   24m17.198s  0m0.006s
 8, 1   0m51.702s   6m15.400s

Re: Fwd: Re: STDCXX-1071 numpunct facet defect

2012-10-04 Thread Liviu Nicoara

On 10/4/12 10:10 PM, Liviu Nicoara wrote:


On 10/3/12 11:10 AM, Martin Sebor wrote:

On 10/03/2012 07:01 AM, Liviu Nicoara wrote:


I am gathering some more measurements along these lines but it's time
consuming. I estimate I will have some ready for review later today or
tomorrow. In the meantime could you please post your kernel, glibc and
compiler versions?


I was just thinking of a few simple loops along the lines of:

   void* thread_func (void*) {
   for (int i = 0; i  N; ++)
   test 1: do some simple stuff inline
   test 2: call a virtual function to do the same stuff
   test 3: lock and unlock a mutex and do the same stuff
   }

Test 1 should be the fastest and test 3 the slowest. This should
hold regardless of what simple stuff is (eventually, even when
it's getting numpunct::grouping() data).


That is expected; I attached test case x.cpp and results-x.txt.

I did not find it too interesting in its own, though. The difference between the
cached and non-cached data is that in the case of the cached data the copying of
the string involves nothing more than a bump in the reference counter, whereas
in the non-cached version a string object is constructed anew, and memory gets
allocated for its body. Yet, in my measurements, the cached version is the one
which shows the worse performance.

So, I extracted the std::string class and simplified it down and put it in
another test case. That would be u.cpp and the results are results-u.txt. The
results show the same performance trends although the absolute values have
skewed. Will get back to this after I digest the results a bit more.


Attached some incomplete files previously. Corrected.

Liviu
-*- mode: org -*-

* iMac, 4x Core i5 , 12S, gcc 4.5.4:

$ nice make u

16, 1   1m18.811s   5m12.329s   0m0.263s
8, 10m39.919s   2m36.198s   0m0.150s
4, 10m20.449s   1m20.797s   0m0.050s
2, 10m9.888s0m19.725s   0m0.005s
1, 10m2.483s0m2.480s0m0.002s

$ nice make CPPOPTS=-DNO_CACHE u

16, 1   0m37.418s   2m27.822s   0m0.872s
8, 10m18.844s   1m14.607s   0m0.261s
4, 10m10.165s   0m40.147s   0m0.023s
2, 10m8.652s0m17.278s   0m0.003s
1, 10m8.482s0m8.473s0m0.007s

$ nice make CPPOPTS=-DNO_VIRTUAL_CALL u

16, 1   1m2.770s4m9.307s0m0.179s
8, 10m31.890s   2m6.792s0m0.087s
4, 10m16.427s   1m5.133s0m0.039s
2, 10m8.497s0m16.981s   0m0.007s
1, 10m2.291s0m2.288s0m0.002s

$ nice make CPPOPTS=-DNO_CACHE -DNO_VIRTUAL_CALL u

16, 1   0m35.838s   2m21.406s   0m0.877s
8, 10m19.007s   1m14.920s   0m0.255s
4, 10m10.099s   0m39.504s   0m0.042s
2, 10m8.599s0m17.190s   0m0.003s
1, 10m8.986s0m8.980s0m0.005s

* Linux Slackware, 16x AMD Opteron, 12S, gcc 4.5.2

$ nice make u

16, 1   2m7.644s30m10.014s  0m0.006s
 8, 1   1m4.449s7m45.111s   0m0.004s
 4, 1   0m31.646s   2m1.438s0m0.002s
 2, 1   0m14.828s   0m29.637s   0m0.002s
 1, 1   0m3.223s0m3.220s0m0.002s


$ nice make CPPOPTS=-DNO_CACHE u

16, 1   0m10.730s   2m46.032s   0m2.204s
 8, 1   0m9.922s1m18.892s   0m0.008s
 4, 1   0m9.874s0m39.192s   0m0.003s
 2, 1   0m9.652s0m19.268s   0m0.004s
 1, 1   0m9.622s0m9.617s0m0.002s


$ nice make CPPOPTS=-DNO_VIRTUAL_CALL u

16, 1   1m54.812s   27m5.957s   0m0.007s
 8, 1   0m53.862s   6m26.391s   0m0.003s
 4, 1   0m26.287s   1m39.991s   0m0.001s
 2, 1   0m9.120s0m17.233s   0m0.002s
 1, 1   0m3.004s0m3.002s0m0.001s


$ nice make CPPOPTS=-DNO_CACHE -DNO_VIRTUAL_CALL u

16, 1   0m9.717s2m34.412s   0m0.401s
 8, 1   0m9.613s1m16.762s   0m0.007s
 4, 1   0m9.590s0m38.285s   0m0.004s
 2, 1   0m9.629s0m19.165s   0m0.002s
 1, 1   0m9.361s0m9.357s0m0.001s
-*- mode: org -*-

* iMac, 4x Core i5 , 12S, gcc 4.5.4:

$ nice make CPPOPTS=-DNO_LOCK -DNO_VIRTUAL_CALL u

16, 1   0m7.864s0m30.259s   0m0.035s
 8, 1   0m4.396s0m17.034s   0m0.016s
 4, 1   0m2.729s0m10.473s   0m0.011s
 2, 1   0m2.481s0m4.929s0m0.003s
 1, 1   0m2.461s0m2.455s0m0.002s

$ nice make CPPOPTS=-DNO_LOCK u

16, 1   0m9.724s0m37.455s   0m0.043s
 8, 1   0m5.559s0m20.309s   0m0.048s
 4, 1   0m3.160s0m12.213s   0m0.013s
 2

Re: Fwd: Re: STDCXX-1071 numpunct facet defect

2012-10-03 Thread Liviu Nicoara

On 10/02/12 10:41, Martin Sebor wrote:

I haven't had time to look at this since my last email on
Sunday. I also forgot about the string mutex. I don't think
I'll have time to spend on this until later in the week.
Unless the disassembly reveals the smoking gun, I think we
might need to simplify the test to get to the bottom of the
differences in our measurements. (I.e., eliminate the library
and measure the runtime of a simple thread loop, with and
without locking.) We should also look at the GLIBC and
kernel versions on our systems, on the off chance that
there has been a change that could explain the discrepancy
between my numbers and yours. I suspect my system (RHEL 4.8)
is much older than yours (I don't remember now if you posted
your details).


I am gathering some more measurements along these lines but it's time 
consuming. I estimate I will have some ready for review later today or 
tomorrow. In the meantime could you please post your kernel, glibc and 
compiler versions?


Liviu



Martin

On 10/02/2012 06:22 AM, Liviu Nicoara wrote:

On 09/30/12 18:18, Martin Sebor wrote:

I see you did a 64-bit build while I did a 32-bit one. so
I tried 64-bits. The cached version (i.e., the one compiled
with -UNO_USE_NUMPUNCT_CACHE) is still about twice as fast
as the non-cached one (compiled with -DNO_USE_NUMPUNCT_CACHE).

I had made one change to the test program that I thought might
account for the difference: I removed the call to abort from
the thread function since it was causing the process to exit
prematurely in some of my tests. But since you used the
modified program for your latest measurements that couldn't
be it.

I can't explain the differences. They just don't make sense
to me. Your results should be the other way around. Can you
post the disassembly of function f() for each of the two
configurations of the test?


The first thing that struck me in the cached `f' was that __string_ref
class uses a mutex for synchronizing access to the ref counter. It turns
out, for Linux on AMD64 we explicitly use a mutex instead of the atomic
ops on the ref counter, via a block in rw/_config.h:

# if _RWSTD_VER_MAJOR  5
# ifdef _RWSTD_OS_LINUX
// on Linux/AMD64, unless explicitly requested, disable the use
// of atomic operations in string for binary compatibility with
// stdcxx 4.1.x
# ifndef _RWSTD_USE_STRING_ATOMIC_OPS
# define _RWSTD_NO_STRING_ATOMIC_OPS
# endif // _RWSTD_USE_STRING_ATOMIC_OPS
# endif // _WIN32
# endif // stdcxx  5.0


That is not the cause for the performance difference, though. Even after
building with __RWSTD_USE_STRING_ATOMIC_OPS I get the same better
performance with the non-cached version.

Liviu




Re: Fwd: Re: STDCXX-1071 numpunct facet defect

2012-10-02 Thread Liviu Nicoara

On 09/30/12 18:18, Martin Sebor wrote:

I see you did a 64-bit build while I did a 32-bit one. so
I tried 64-bits. The cached version (i.e., the one compiled
with -UNO_USE_NUMPUNCT_CACHE) is still about twice as fast
as the non-cached one (compiled with -DNO_USE_NUMPUNCT_CACHE).

I had made one change to the test program that I thought might
account for the difference: I removed the call to abort from
the thread function since it was causing the process to exit
prematurely in some of my tests. But since you used the
modified program for your latest measurements that couldn't
be it.

I can't explain the differences. They just don't make sense
to me. Your results should be the other way around. Can you
post the disassembly of function f() for each of the two
configurations of the test?


The first thing that struck me in the cached `f' was that __string_ref 
class uses a mutex for synchronizing access to the ref counter. It turns 
out, for Linux on AMD64 we explicitly use a mutex instead of the atomic 
ops on the ref counter, via a block in rw/_config.h:


#  if _RWSTD_VER_MAJOR  5
#ifdef _RWSTD_OS_LINUX
   // on Linux/AMD64, unless explicitly requested, disable the use
   // of atomic operations in string for binary compatibility with
   // stdcxx 4.1.x
#  ifndef _RWSTD_USE_STRING_ATOMIC_OPS
#define _RWSTD_NO_STRING_ATOMIC_OPS
#  endif   // _RWSTD_USE_STRING_ATOMIC_OPS
#endif   // _WIN32
#  endif   // stdcxx  5.0


That is not the cause for the performance difference, though. Even after 
building with __RWSTD_USE_STRING_ATOMIC_OPS I get the same better 
performance with the non-cached version.


Liviu


Fwd: Re: STDCXX-1071 numpunct facet defect

2012-09-30 Thread Liviu Nicoara

Forwarding with the attachment.

 Original Message 
Subject: Re: STDCXX-1071 numpunct facet defect
Date: Sun, 30 Sep 2012 12:09:10 -0600
From: Martin Sebor mse...@gmail.com
To: Liviu Nicoara nikko...@hates.ms

On 09/27/2012 06:36 PM, Liviu Nicoara wrote:

On 9/27/12 8:27 PM, Martin Sebor wrote:

On 09/27/2012 06:41 AM, Liviu Nicoara wrote:

On 09/26/12 20:12, Liviu Nicoara wrote:

I have created STDCXX-1071 and linked to STDCXX-1056. [...]

I am open to all questions, the more the better. Most of my opinions
have been expressed earlier, but please ask if you want to know more.



I am attaching here the proposed (4.3.x) patch and the timings results
(after re-verifying the correctness of the timing program and the
results). The 4.2.x patch, the 4.3.x patch, the test program and the
results file are also attached to the incident.


The patch isn't binary compatible. We can't remove data members
in a minor release. We could only do it in a major release.


There are two of them, one for 4.2.x, one for 4.3.x.



I'm still curious about the performance, though. It doesn't make
sense to me that a call to a virtual function is faster than one
to an almost trivial inline function.


I scratched my head long on that. I wager that it depends on the
machine, too.


Here are my timings for library-reduction.cpp when compiled
GCC 4.5.3 on Solaris 10 (4 SPARCV9 CPUs). I had to make a small
number of trivial changes to get it to compile:

 With cache   No cache
real1m38.332s 8m58.568s
user6m30.244s34m25.942s
sys 0m0.060s  0m3.922s

I also experimented with the program on Linux (CEL 4 with 16
CPUs). Initially, I saw no differences between the two versions.
So I modified it a bit to make it closer to the library (the
modified program is attached). With those changes the timings
are below:

 With cache   No cache
real0m 1.107s0m 5.669s
user0m17.204s0m 5.669s
sys 0m 0.000s0m22.347s

I also recompiled and re-ran the test on Solaris. To speed
things along, I set the number threads and loops to 8 and
100. The numbers are as follows:

 With cache   No cache
real0m3.341s 0m26.333s
user0m13.052s1m37.470s
sys 0m0.009s 0m0.132s

The numbers match my expectation. The overhead without the
numpunct cache is considerable.

Somewhat unexpectedly, the test with the cache didn't crash.

I also tried timing the numpunct patch attached to the issue
with the test program. The test program runs to completion
without the patch but it crashes with a SIGSEGV after the
patch is applied. That suggests there's a bug somewhere in
do_thousands_sep() (in addition to the caching).

Martin



Let me know if there is anything I could help with.

Liviu





#include iostream
#include locale

#include cstdio
#include cstdlib
#include cstring

#include pthread.h
#include unistd.h

#define MAX_THREADS 128

static long nloops = 1000, nthreads = 16;
static bool volatile pwait = true;



namespace X {

struct guard 
{
guard (pthread_mutex_t* ptr) : lock_ (ptr) { 
pthread_mutex_lock (lock_);
}

~guard () { 
pthread_mutex_unlock (lock_); 
}

private:

guard (guard const);
guard operator= (guard const);

private:

pthread_mutex_t* lock_;
};

struct facet
{
facet () {
pthread_mutex_init (_C_mutex, 0);
}

void* _C_data () {
return _C_impsize ? _C_impdata : _C_get_data ();
}

void* _C_get_data ();

size_t _C_impsize;
void*  _C_impdata;

pthread_mutex_t _C_mutex;
};

void* facet::_C_get_data ()
{
if (_C_impsize)
return _C_impdata;

guard g (_C_mutex);

if (_C_impsize)
return _C_impdata;

#if !defined (NO_USE_STDCXX_LOCALES)
_C_impdata = (void*)\3\3;
#endif // USE_STDCXX_LOCALES
_C_impsize = (size_t)(-1);

return _C_impdata;
}

void* get_data (facet*);

struct numpunct : facet
{
numpunct () : _C_flags () { }

virtual std::string do_grouping () const;

std::string grouping () const {
numpunct* const __self = (numpunct*)this;

#if !defined (NO_USE_NUMPUNCT_CACHE)
if (!(_C_flags  1)) {
__self-_C_flags  |= 1;
__self-_C_grouping = do_grouping ();
}

return _C_grouping;
#else
return do_grouping ();
#endif // NO_USE_NUMPUNCT_CACHE
}

private:

int  _C_flags;
std::string  _C_grouping;
};

std::string numpunct::do_grouping () const
{
return (const char*)get_data (const_castnumpunct*(this));
}

void* get_data (facet* pfacet)
{
void* pdata = pfacet-_C_data ();

if (pdata) 
return pdata;

// __rw_setlocale loc (...); - other mutex

if (pfacet-_C_data ())
return get_data (pfacet);

pfacet-_C_impdata = const_castchar*(\4\4);
pfacet-_C_impsize = (size_t)(-1);

return 0;
}

} // namespace X

Re: Fwd: Re: STDCXX-1071 numpunct facet defect

2012-09-30 Thread Liviu Nicoara

On 9/30/12 2:21 PM, Liviu Nicoara wrote:

Forwarding with the attachment.

 Original Message 
Subject: Re: STDCXX-1071 numpunct facet defect
Date: Sun, 30 Sep 2012 12:09:10 -0600
From: Martin Sebor mse...@gmail.com
To: Liviu Nicoara nikko...@hates.ms


On 9/27/12 8:27 PM, Martin Sebor wrote:


Here are my timings for library-reduction.cpp when compiled
GCC 4.5.3 on Solaris 10 (4 SPARCV9 CPUs). I had to make a small
number of trivial changes to get it to compile:

  With cache   No cache
real1m38.332s 8m58.568s
user6m30.244s34m25.942s
sys 0m0.060s  0m3.922s

I also experimented with the program on Linux (CEL 4 with 16
CPUs). Initially, I saw no differences between the two versions.
So I modified it a bit to make it closer to the library (the
modified program is attached). With those changes the timings


I see the difference -- your program has a virtual function it calls from the 
inline grouping function.



are below:

  With cache   No cache
real0m 1.107s0m 5.669s
user0m17.204s0m 5.669s
sys0m 0.000s0m22.347s

I also recompiled and re-ran the test on Solaris. To speed
things along, I set the number threads and loops to 8 and
100. The numbers are as follows:

  With cache   No cache
real0m3.341s 0m26.333s
user0m13.052s1m37.470s
sys 0m0.009s 0m0.132s

The numbers match my expectation. The overhead without the
numpunct cache is considerable.


I have done another (smaller) round of measurements, this time using the test 
program you posted. Here are the results:


* iMac, 4x Intel, 12S:

16, 1000:

Cached   Not cached
real0m9.300s 0m5.224s
user0m36.441s0m20.523s
sys 0m0.043s 0m0.068s

* iMac, 4x Intel, 12D:

CachedNot cached
real0m9.012s  0m5.774s
user0m35.343s 0m20.997s
sys 0m0.045s  0m0.183s

* Linux Slackware, 16x AMD Opteron, 12S:

16, 1000:

Cached Not cached
real0m29.798s  0m3.278s
user0m48.662s  0m47.338s
sys 6m18.525s  0m3.298s



Somewhat unexpectedly, the test with the cache didn't crash.


On my iMac it did not crash for me either (gcc 4.5.4), this time. On the other 
box (gcc 4.5.2) crashed every time with caching, so I had to add a call to 
fac.grouping outside the thread function to initialize the facet.


Liviu


Fwd: Re: Fwd: Re: STDCXX-1071 numpunct facet defect

2012-09-30 Thread Liviu Nicoara

Forwarding to the list. Duh.


 Original Message 
Subject: Re: Fwd: Re: STDCXX-1071 numpunct facet defect
Date: Sun, 30 Sep 2012 19:02:27 -0400
From: Liviu Nicoara nikko...@hates.ms
To: Martin Sebor mse...@gmail.com

On 9/30/12 6:18 PM, Martin Sebor wrote:

I see you did a 64-bit build while I did a 32-bit one. so
I tried 64-bits. The cached version (i.e., the one compiled
with -UNO_USE_NUMPUNCT_CACHE) is still about twice as fast
as the non-cached one (compiled with -DNO_USE_NUMPUNCT_CACHE).

I had made one change to the test program that I thought might
account for the difference: I removed the call to abort from
the thread function since it was causing the process to exit
prematurely in some of my tests. But since you used the
modified program for your latest measurements that couldn't
be it.

I can't explain the differences. They just don't make sense
to me. Your results should be the other way around. Can you
post the disassembly of function f() for each of the two
configurations of the test?



Here they are.

Liviu




Dump of assembler code for function f:
   0x00403870 +0: push   %r15
   0x00403872 +2: push   %r14
   0x00403874 +4: push   %r13
   0x00403876 +6: push   %r12
   0x00403878 +8: push   %rbp
   0x00403879 +9: push   %rbx
   0x0040387a +10:mov%rdi,%rbx
   0x0040387d +13:sub$0x38,%rsp
   0x00403881 +17:nopl   0x0(%rax)
   0x00403888 +24:movzbl 0x261f11(%rip),%eax# 0x6657a0 
_ZL5pwait
   0x0040388f +31:test   %al,%al
   0x00403891 +33:jne0x403888 f+24
   0x00403893 +35:cmpq   $0x0,0x261ef5(%rip)# 0x665790 
_ZL6nloops
   0x0040389b +43:jle0x403b12 f+674
   0x004038a1 +49:xor%ebp,%ebp
   0x004038a3 +51:xor%r12d,%r12d
   0x004038a6 +54:lea0x10(%rsp),%r13
   0x004038ab +59:lea0x48(%rbx),%r14
   0x004038af +63:lea0x20(%rsp),%r15
   0x004038b4 +68:jmpq   0x4039a7 f+311
   0x004038b9 +73:nopl   0x0(%rax)
   0x004038c0 +80:cmp$0x66a020,%rdx
   0x004038c7 +87:mov%rdi,0x20(%rsp)
   0x004038cc +92:je 0x403ab8 f+584
   0x004038d2 +98:mov%rdx,%rdi
   0x004038d5 +101:   mov%rdx,(%rsp)
   0x004038d9 +105:   callq  0x403658 pthread_mutex_lock@plt
   0x004038de +110:   test   %eax,%eax
   0x004038e0 +112:   mov(%rsp),%rdx
   0x004038e4 +116:   je 0x4038fb f+139
   0x004038e6 +118:   mov$0x4452a4,%esi
   0x004038eb +123:   mov$0xa,%edi
   0x004038f0 +128:   xor%eax,%eax
   0x004038f2 +130:   callq  0x404370 _ZN4__rw10__rw_throwEiz
   0x004038f7 +135:   mov(%rsp),%rdx
   0x004038fb +139:   addl   $0x1,0x28(%rdx)
   0x004038ff +143:   test   %rdx,%rdx
   0x00403902 +146:   je 0x40390c f+156
   0x00403904 +148:   mov%rdx,%rdi
   0x00403907 +151:   callq  0x4036c8 pthread_mutex_unlock@plt
   0x0040390c +156:   mov0x20(%rsp),%rdx
   0x00403911 +161:   mov%rdx,%rdi
   0x00403914 +164:   mov%rdx,(%rsp)
   0x00403918 +168:   callq  0x403258 strlen@plt
   0x0040391d +173:   mov(%rsp),%rdx
   0x00403921 +177:   add%rax,%r12
   0x00403924 +180:   lea-0x40(%rdx),%rcx
   0x00403928 +184:   cmp$0x66a020,%rcx
   0x0040392f +191:   je 0x40398b f+283
   0x00403931 +193:   mov%rcx,%rdi
   0x00403934 +196:   mov%rcx,0x8(%rsp)
   0x00403939 +201:   callq  0x403658 pthread_mutex_lock@plt
   0x0040393e +206:   test   %eax,%eax
   0x00403940 +208:   mov(%rsp),%rdx
   0x00403944 +212:   mov0x8(%rsp),%rcx
   0x00403949 +217:   je 0x403965 f+245
   0x0040394b +219:   mov$0x4452a4,%esi
   0x00403950 +224:   mov$0xa,%edi
   0x00403955 +229:   xor%eax,%eax
   0x00403957 +231:   callq  0x404370 _ZN4__rw10__rw_throwEiz
   0x0040395c +236:   mov0x8(%rsp),%rcx
   0x00403961 +241:   mov(%rsp),%rdx
   0x00403965 +245:   mov-0x18(%rdx),%esi
   0x00403968 +248:   test   %rcx,%rcx
   0x0040396b +251:   lea-0x1(%rsi),%eax
   0x0040396e +254:   mov%eax,-0x18(%rdx)
   0x00403971 +257:   je 0x403983 f+275
   0x00403973 +259:   mov%rcx,%rdi
   0x00403976 +262:   mov%esi,0x8(%rsp)
   0x0040397a +266:   callq  0x4036c8 pthread_mutex_unlock@plt
   0x0040397f +271:   mov0x8(%rsp),%esi
   0x00403983 +275:   test   %esi,%esi
   0x00403985 +277:   jle0x403a80 f+528
   0x0040398b +283:   add$0x1,%ebp
   0x0040398e +286:   movq   $0x0,0x20(%rsp

[PATCH] Re: STDCXX-1072 SPARC V8 mutex alignment requirements

2012-09-29 Thread Liviu Nicoara

On 9/28/12 11:32 AM, Travis Vitek wrote:



-Original Message-
From: Liviu Nicoara
Sent: Friday, September 28, 2012 5:29 AM
[...]

The patch assumes the type `long double' exists on every platform. While I do 
believe that it is available everywhere, we have lots of conditional code 
guarding its use in the library now. If we are going to use `long double' in 
this context, we should guard it with _RWSTD_NO_LONG_DOUBLE. I think an even 
cleaner solution is to switch to using __rw_aligned_buffer instead. It gives us 
a single point of failure for alignment issues like this, and it makes the code 
self-documenting and easier to read.


I am attaching another patch here, which makes use of the __rw_aligned_buffer, 
slightly more verbose but the code is slightly cleaner.


Thanks!

Liviu


Index: src/messages.cpp
===
--- src/messages.cpp(revision 1390953)
+++ src/messages.cpp(working copy)
@@ -38,6 +38,7 @@
 
 #include rw/_mutex.h
 
+#include podarray.h
 
 #if !defined (_RWSTD_NO_NL_TYPES_H)  !defined (_RWSTD_NO_CATOPEN_CATGETS)
 #  include nl_types.h
@@ -64,10 +65,7 @@ _RWSTD_NAMESPACE (__rw) {
 struct __rw_open_cat_data
 {
 nl_catd catd;
-union {
-void *_C_align;
-char _C_data [sizeof (_STD::locale)];
-} loc;
+__rw_aligned_buffer_STD::locale loc;
 };
 
 struct __rw_open_cat_page
@@ -159,7 +157,8 @@ __rw_manage_cat_data (int cat,  __rw_op
 cat = int (n_catalogs);
 
 catalogs [cat]-catd = pcat_data-catd;
-memcpy (catalogs [cat]-loc, pcat_data-loc,
+memcpy (catalogs [cat]-loc._C_store (), 
+pcat_data-loc._C_store (),
 sizeof (_STD::locale));
 
 if (size_t (cat)  largest_cat)
@@ -175,7 +174,8 @@ __rw_manage_cat_data (int cat,  __rw_op
 }
 
 catalogs [cat]-catd = pcat_data-catd;
-memcpy (catalogs [cat]-loc, pcat_data-loc,
+memcpy (catalogs [cat]-loc._C_store (),
+pcat_data-loc._C_store (),
 sizeof (_STD::locale));
 
 if (size_t (cat)  largest_cat)
@@ -258,8 +258,9 @@ int __rw_cat_open (const _STD::string c
 return -1;
 
 __rw_open_cat_data cat_data;
+
 cat_data.catd = catd;
-new (cat_data.loc) _STD::locale (loc);
+new (cat_data.loc._C_store ()) _STD::locale (loc);
 
 int cat = -1;
 __rw_manage_cat_data (cat, cat_data);
@@ -307,7 +308,7 @@ const _STD::locale __rw_get_locale (int
 
 _RWSTD_ASSERT (0 != pcat_data);
 
-return *(_RWSTD_REINTERPRET_CAST (_STD::locale*, (pcat_data-loc)));
+return *pcat_data-loc._C_data ();
 }
 
 
@@ -329,8 +330,7 @@ void __rw_cat_close (int cat)
 
 catclose (pcat_data-catd);
 
-_STD::locale* const ploc =
-_RWSTD_REINTERPRET_CAST (_STD::locale*, pcat_data-loc);
+_STD::locale* const ploc = pcat_data-loc._C_data ();
 
 ploc-~locale ();
 
Index: src/locale_body.cpp
===
--- src/locale_body.cpp (revision 1390953)
+++ src/locale_body.cpp (working copy)
@@ -1024,14 +1024,12 @@ _C_manage (__rw_locale *plocale, const c
 
 // the classic C locale is statically allocated
 // and not destroyed during the lifetime of the process
-static union {
-void* _C_align;
-char  _C_buf [sizeof (__rw_locale)];
-} classic_body;
+static __rw_aligned_buffer__rw_locale classic_body;
 
 // construct a locale body in place
 // with the initial reference count of 1
-classic = new (classic_body) __rw_locale (locname);
+classic = new (classic_body._C_store ()) 
+__rw_locale (locname);
 
 _RWSTD_ASSERT (1 == classic-_C_ref);
 }
Index: src/use_facet.h
===
--- src/use_facet.h (revision 1390953)
+++ src/use_facet.h (working copy)
@@ -37,6 +37,7 @@
 #include rw/_defs.h
 
 #include access.h
+#include podarray.h
 
 
 // helper macro _RWSTD_DEFINE_FACET_FACTORY() defines a facet factory
@@ -63,12 +64,9 @@
}   \
else {  \
/* construct an ordinary facet in static memory */  \
-   static union {  \
-   void *align_;   \
-   char  data_ [sizeof (__rw_ ## fid ## _facet)];  \
-   } f

STDCXX-970 and locale tests

2012-09-29 Thread Liviu Nicoara
While checking for fallout from 1072 I stumbled upon the collate test and its 
numerous (unrelated to 1072) failures. While I am looking at it I have a 
question: it seems to me that the locale tests do not test against STDCXX locale 
database. Is that right? So, all locale tests are supposed to be run without 
RWSTD_LOCALE_ROOT defined?


Thanks,
Liviu


Re: STDCXX-1071 numpunct facet defect

2012-09-28 Thread Liviu Nicoara

I thought I replied but I see no trace of my post:

On 09/27/12 20:27, Martin Sebor wrote:

On 09/27/2012 06:41 AM, Liviu Nicoara wrote:

On 09/26/12 20:12, Liviu Nicoara wrote:

I have created STDCXX-1071 and linked to STDCXX-1056. [...]

I am open to all questions, the more the better. Most of my opinions
have been expressed earlier, but please ask if you want to know more.



I am attaching here the proposed (4.3.x) patch and the timings results
(after re-verifying the correctness of the timing program and the
results). The 4.2.x patch, the 4.3.x patch, the test program and the
results file are also attached to the incident.


The patch isn't binary compatible. We can't remove data members
in a minor release. We could only do it in a major release.


I posted two patches, one for 4.3.x and one for 4.2.x (the latter is binary 
compatible). My understanding was that minor version revisions may break binary 
compatibility.

Liviu


STDCXX-1072 SPARC V8 mutex alignment requirements

2012-09-28 Thread Liviu Nicoara

I have created the above and linked it to the closed STDCXX-1066.

In short, my reading about this issue is that the kernel patch changed the 
alignment of the userland mutex objects from a machine word to a double-word 
boundary. No changes are required of the users who use such objects in their 
programs unless users create mutex objects in buffers which may not be aligned 
on a proper boundary. E.g., the following are safe:

mutex_t lock;

struct S {
char misalign;
mutex_t lock;
};

whereas the following is not:

union {
void* align;
char buf [sizeof mutex_t];
} u;

new (u) mutex_t;

because the alignment requirements for void pointer are less strict than for 
mutex_t. A few places in the library use the latter for all sorts of static 
objects (mostly local statics). I looked esp. for places where we build objects 
that contain mutex sub-objects inside a union-aligned buffer:

struct S {
char c;
mutex_t m;
};

...

union {
void* align; // - incorrect
char buf [sizeof (S)];
} u;

new (u) S ();

The alignment must be changed to a value equal or greater than the mutex 
alignment requirements.

IMO, the patch I attached does not break binary compatibility. It uses a one 
size fits all long double for alignment -- like we use in rw/_mutex.h -- but in 
doing so dispenses with all sorts of preprocessor conditionals.

I still don't have access to a SPARC machine. Any feed-back and/or SPARC build 
results are more than welcome!

Thanks!

Liviu

Index: src/messages.cpp
===
--- src/messages.cpp(revision 1390953)
+++ src/messages.cpp(working copy)
@@ -65,7 +65,7 @@ struct __rw_open_cat_data
 {
 nl_catd catd;
 union {
-void *_C_align;
+long double _C_align;
 char _C_data [sizeof (_STD::locale)];
 } loc;
 };
Index: src/locale_body.cpp
===
--- src/locale_body.cpp (revision 1390953)
+++ src/locale_body.cpp (working copy)
@@ -1025,7 +1025,7 @@ _C_manage (__rw_locale *plocale, const c
 // the classic C locale is statically allocated
 // and not destroyed during the lifetime of the process
 static union {
-void* _C_align;
+long double _C_align;
 char  _C_buf [sizeof (__rw_locale)];
 } classic_body;
 
Index: src/use_facet.h
===
--- src/use_facet.h (revision 1390953)
+++ src/use_facet.h (working copy)
@@ -64,7 +64,7 @@
else {  \
/* construct an ordinary facet in static memory */  \
static union {  \
-   void *align_;   \
+   long double align_; \
char  data_ [sizeof (__rw_ ## fid ## _facet)];  \
} f;\
static __rw_facet* const pf =   \
@@ -91,7 +91,7 @@
{   \
/* construct an ordinary facet in static memory */  \
static union {  \
-   void *align_;   \
+   long double align_; \
char  data_ [sizeof (__rw_ ## fid ## _facet)];  \
} f;\
static __rw_facet* const pf =   \
Index: src/ctype.cpp
===
--- src/ctype.cpp   (revision 1390953)
+++ src/ctype.cpp   (working copy)
@@ -627,7 +627,7 @@ _RWSTD_EXPORT __rw_facet* __rw_ct_ctype
 }
 else {
 static union {
-void *align_;
+long double align_;
 char  data_ [sizeof (_STD::ctypechar)];
 } f;
 static __rw_facet* const pf =
Index: src/locale_classic.cpp
===
--- src/locale_classic.cpp  (revision 1390953)
+++ src/locale_classic.cpp  (working copy)
@@ -39,7 +39,7 @@ _RWSTD_NAMESPACE (__rw) {
 
 // static buffer for the classic C locale object
 static union {
-void* _C_align;
+long double _C_align;
 char  _C_buf [sizeof (_STD::locale)];
 } __rw_classic;
 


Re: STDCXX-1072 SPARC V8 mutex alignment requirements

2012-09-28 Thread Liviu Nicoara

On 09/28/12 08:29, Liviu Nicoara wrote:

I have created the above and linked it to the closed STDCXX-1066.
[...]
IMO, the patch I attached does not break binary compatibility.


Scratch this, I haven't thought it through.

Thanks,
Liviu


Re: STDCXX-1072 SPARC V8 mutex alignment requirements

2012-09-28 Thread Liviu Nicoara

On 09/28/12 08:45, Liviu Nicoara wrote:

On 09/28/12 08:29, Liviu Nicoara wrote:

I have created the above and linked it to the closed STDCXX-1066.
[...]
IMO, the patch I attached does not break binary compatibility.


Scratch this, I haven't thought it through.


Actually, after more thought, I believe that the patch itself is not causing 
binary incompatibility. The library built on a kernel-patched system will be 
binary incompatible with a library built on a previous system, regardless of 
the patch I presented earlier, because new mutex alignment requirements will 
have changed the size and layout of library objects. Comments are welcome.

Thanks,
Liviu



Re: STDCXX-1071 numpunct facet defect

2012-09-28 Thread Liviu Nicoara

On 09/28/12 11:01, Travis Vitek wrote:

Only major versions can break binary. The versioning policy for stdcxx can be 
found here..

   http://stdcxx.apache.org/versions.html


Thanks, that clarifies things.

Liviu


Re: STDCXX-1072 SPARC V8 mutex alignment requirements

2012-09-28 Thread Liviu Nicoara

On 09/28/12 11:45, Travis Vitek wrote:

Liviu,

Sorry I didn't look until just now, but it appears that I could have re-opened 
STDCXX-1066. I only see the 'Reopen Issue' button for STDCXX issues, but it is 
most definitely there. Perhaps there is some sort of permission issue for you?


It's ok, I think it's somewhat cleaner to create a new one and link it to the 
old one. Even if clean was not a concern, it was within Stefan's options to 
close the incident. I don't know.



Also, STDCXX-1066 appears to have been a duplicate of STDCXX-1040.


Yep, forgot about it, I am thinking about linking that one, too.

Thanks.


Re: STDCXX-1072 SPARC V8 mutex alignment requirements

2012-09-28 Thread Liviu Nicoara

On 09/28/12 11:32, Travis Vitek wrote:



-Original Message-
From: Liviu Nicoara
Sent: Friday, September 28, 2012 5:29 AM





In short, my reading about this issue is that the kernel patch changed
the alignment of the userland mutex objects from a machine word to a
double-word boundary. No changes are required of the users who use such
objects in their programs unless users create mutex objects in buffers
which may not be aligned on a proper boundary.


Your reading of this is consistent with my previous understanding of the 
problem, so that is good.



I still don't have access to a SPARC machine. Any feed-back and/or
SPARC build results are more than welcome!



I can provide build results for SPARCV9 if we want them, but I thought that the 
problem only came up on 32-bit SPARCV8 builds.


I guess a -xarch=sparcv8 build will do. It is quite unlikely anybody has a real 
32-bit SPARC hardware.



The patch assumes the type `long double' exists on every platform. While I do 
believe that it is available everywhere, we have lots of conditional code 
guarding its use in the library now. If we are going to use `long double' in 
this context, we should guard it with _RWSTD_NO_LONG_DOUBLE. I think an even 
cleaner solution is to switch to using __rw_aligned_buffer instead. It gives us 
a single point of failure for alignment issues like this, and it makes the code 
self-documenting and easier to read.


I took a cue from an alignment in the rw/_mutex.h code there. I'll do better.



As for your concerns about binary compatibility, I think that everything should 
be okay. We aren't changing the size of anything that is being passed around, 
we're just changing its alignment. I could be wrong, but I'm pretty sure that 
we're safe.



The library object sizes and layouts will change with or without our patch. 
Before the kernel patch our objects will have one layout and size and different 
ones afterwards. E.g.:

struct locale {
int c;
pthread_mutex_t lock;
};

before kernel patching you'd have no padding between the members. That's what I 
thought when firing that second post about compatibility.

Liviu


Re: STDCXX-1072 SPARC V8 mutex alignment requirements

2012-09-28 Thread Liviu Nicoara

On 09/28/12 13:51, Martin Sebor wrote:

[...]
One other comment: I would suggest choosing subjects for bug
reports that reflect the problem rather than a fix for it or
a rationale for it. For  STDCXX-1066 I think something like
Library mutex objects misaligned on SPARCV8 would better
capture the problem than the current title. (It's also up
to us to rename an issue if we find it more descriptive
than the original.)


I can't do that myself. I looked at that and 1056 and there is no button for me 
to reopen, or to edit stuff which is not mine.

Needless to say, I like my issues better. But I have no problems with the 
re-opening of the closed ones.

Thanks,
Liviu



Re: STDCXX-1071 numpunct facet defect

2012-09-28 Thread Liviu Nicoara

On 9/28/12 11:01 AM, Travis Vitek wrote:

Only major versions can break binary. The versioning policy for stdcxx can be 
found here..

   http://stdcxx.apache.org/versions.html


I have renamed the binary-incompatible patch as patch-5.0.x.diff.

Thanks,
Liviu



Travis

-Original Message-
From: Liviu Nicoara [mailto:nikko...@hates.ms]
Sent: Friday, September 28, 2012 3:52 AM
To: dev@stdcxx.apache.org
Subject: Re: STDCXX-1071 numpunct facet defect

I thought I replied but I see no trace of my post:

On 09/27/12 20:27, Martin Sebor wrote:

On 09/27/2012 06:41 AM, Liviu Nicoara wrote:

On 09/26/12 20:12, Liviu Nicoara wrote:

I have created STDCXX-1071 and linked to STDCXX-1056. [...]

I am open to all questions, the more the better. Most of my opinions
have been expressed earlier, but please ask if you want to know more.



I am attaching here the proposed (4.3.x) patch and the timings results
(after re-verifying the correctness of the timing program and the
results). The 4.2.x patch, the 4.3.x patch, the test program and the
results file are also attached to the incident.


The patch isn't binary compatible. We can't remove data members
in a minor release. We could only do it in a major release.


I posted two patches, one for 4.3.x and one for 4.2.x (the latter is binary 
compatible). My understanding was that minor version revisions may break binary 
compatibility.

Liviu





Re: [jira] [Closed] (STDCXX-1066) SPARCV8 requires pthread_mutex_t and pthread_cond_t to be aligned on an 8-byte boundary

2012-09-27 Thread Liviu Nicoara

On 09/27/12 07:15, Pavel Heimlich, a.k.a. hajma wrote:

2012/9/26 Liviu Nicoara nikko...@hates.ms:

On 09/26/12 05:49, Pavel Heimlich, a.k.a. hajma wrote:


2012/9/26 Liviu Nicoara nikko...@hates.ms:


On 9/25/12 7:56 PM, Stefan Teleman (JIRA) wrote:




[

https://issues.apache.org/jira/browse/STDCXX-1066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]




Anybody around here, except Stefan, who has access to a SPARC V8 machine
updated to the specified kernel update or later, and who is willing to
run a
simple test program? It's a 5 minute job at most.



Please point me to the test program.



Hi Pavel,

I attached it. IIRC Solaris had both Solaris threads API and a POSIX threads
API on top of it. Could you please give it a run with MUTEX defined to both
mutex_t and pthread_mutex_t? Might need to tweak the includes.


Here it is, the define seems to make no difference.
BTW I'm not sure about the bug description, was Solaris 10 ever
supported on a sparcv8? I ran it on the crappiest machine I could
find, but all that is sparcv9.


Sun Blade 1000 - UltraSPARC-III+(sparcv9), running Solaris 10u9
bash-3.00# /opt/SUNWspro/bin/CC a.cpp
bash-3.00# ./a.out
24
ffbffcd8 ffbffcf0 ffbffd08 ffbffd20 ffbffd38 ffbffd50 ffbffd68 ffbffd80
ffbffbe0 ffbffc00 ffbffc20 ffbffc40 ffbffc60 ffbffc80 ffbffca0 ffbffcc0
bash-3.00# /opt/SUNWspro/bin/CC -xarch=v8 ./a.cpp
bash-3.00# ./a.out
24
ffbffcd8 ffbffcf0 ffbffd08 ffbffd20 ffbffd38 ffbffd50 ffbffd68 ffbffd80
ffbffbe0 ffbffc00 ffbffc20 ffbffc40 ffbffc60 ffbffc80 ffbffca0 ffbffcc0
bash-3.00# vi a.cpp (--- pthread_mutex_t)
bash-3.00# /opt/SUNWspro/bin/CC a.cpp
bash-3.00# ./a.out
24
ffbffcd8 ffbffcf0 ffbffd08 ffbffd20 ffbffd38 ffbffd50 ffbffd68 ffbffd80
ffbffbe0 ffbffc00 ffbffc20 ffbffc40 ffbffc60 ffbffc80 ffbffca0 ffbffcc0
bash-3.00# /opt/SUNWspro/bin/CC -xarch=v8 ./a.cpp
bash-3.00# ./a.out
24
ffbffcd8 ffbffcf0 ffbffd08 ffbffd20 ffbffd38 ffbffd50 ffbffd68 ffbffd80
ffbffbe0 ffbffc00 ffbffc20 ffbffc40 ffbffc60 ffbffc80 ffbffca0 ffbffcc0

HTH


Very much appreciated.

Liviu


Re: [PATCH] STDCXX-1069 [was: Re: SUNPro 5.12 compilation failure in optimized Linux builds]

2012-09-27 Thread Liviu Nicoara

On 09/23/12 12:15, Liviu Nicoara wrote:

On 09/22/12 00:51, Liviu Nicoara wrote:

Optimized (but not debug) builds fail to compile setlocale.cpp with the error:


A patch and a comment have been attached to the issue.


I am posting it here to save a trip to the JIRA issue. Any feed-back is 
appreciated.

The issue can be tracked down to the interaction between the definition of 
_XOPEN_SOURCE in src/setlocale.cpp and util/path.cpp, the system headers, and Solaris 
Studio C . It seems that the original purpose for the definitions has been lost because 
both S_IFDIR and symlink are defined and declared, respectively, on modern Linux (edit: 
by default). Moreover, it seems that the original definition of _XOPEN_SOURCE was 
incorrect, without a value. I propose we remove the two blocks – see the patch.

Thanks,
Liviu

Index: src/setlocale.cpp
===
--- src/setlocale.cpp   (revision 1388733)
+++ src/setlocale.cpp   (working copy)
@@ -33,11 +33,6 @@
 
 #include rw/_defs.h
 
-#if defined (__linux__)  !defined (_XOPEN_SOURCE)

-   // need S_IFDIR on Linux
-#  define _XOPEN_SOURCE
-#endif   // __linux__  !_XOPEN_SOURCE
-
 #include locale.h   // for setlocale()
 #include stdlib.h   // for getenv()
 #include string.h   // for memcpy(), strcmp()
Index: util/path.cpp
===
--- util/path.cpp   (revision 1388733)
+++ util/path.cpp   (working copy)
@@ -27,16 +27,6 @@
  **/
 
 #ifndef _WIN32

-#  ifdef __linux__
- // for symlink()
-#ifndef _XOPEN_SOURCE
-#  define _XOPEN_SOURCE
-#endif
-#ifndef _XOPEN_SOURCE_EXTENDED
-#  define _XOPEN_SOURCE_EXTENDED
-#endif
-#  endif   // __linux__
-
 #  include unistd.h  // for getcwd()
 #  include sys/stat.h// for struct stat, stat()
 #else



Re: STDCXX-1068 and alignment

2012-09-27 Thread Liviu Nicoara

On 09/22/12 16:22, Stefan Teleman wrote:

On Sat, Sep 22, 2012 at 4:07 PM, Liviu Nicoara nikko...@hates.ms wrote:

Stefan, could you please elaborate on the misaligned reads/writes that you
observed on those platforms? What was failing?


Several tests from the test harness were failing because of it, and
for no other reason.

I'll have to go through my stuff at work and look at the test reports
from the Compiler Guys, we were testing these back in March 2011, and
right now I can't remember exactly which tests were failing. I'll have
the details in a few days.



Any news on this one?

Thanks,
Liviu


Re: STDCXX-1071 numpunct facet defect

2012-09-27 Thread Liviu Nicoara

On 09/26/12 20:12, Liviu Nicoara wrote:

I have created STDCXX-1071 and linked to STDCXX-1056. [...]

I am open to all questions, the more the better. Most of my opinions have been 
expressed earlier, but please ask if you want to know more.



I am attaching here the proposed (4.3.x) patch and the timings results (after 
re-verifying the correctness of the timing program and the results). The 4.2.x 
patch, the 4.3.x patch, the test program and the results file are also attached 
to the incident.

Thanks,
Liviu



Index: include/loc/_numpunct.h
===
--- include/loc/_numpunct.h (revision 1388733)
+++ include/loc/_numpunct.h (working copy)
@@ -61,7 +61,7 @@ struct numpunct: _RW::__rw_facet
 string_type;
 
 _EXPLICIT numpunct (_RWSTD_SIZE_T __ref = 0)
-: _RW::__rw_facet (__ref), _C_flags (0) { }
+: _RW::__rw_facet (__ref) { }
 
 virtual ~numpunct () _RWSTD_ATTRIBUTE_NOTHROW;
 
@@ -109,15 +109,6 @@ protected:
 virtual string_type do_falsename () const {
 return _RW::__rw_get_punct (this, _RW::__rw_fn, char_type ());
 }
-
-private:
-
-int _C_flags;   // bitmap of cached data valid flags
-string  _C_grouping;// cached results of virtual members
-string_type _C_truename;
-string_type _C_falsename;
-char_type   _C_decimal_point;
-char_type   _C_thousands_sep;
 };
 
 
@@ -139,17 +130,7 @@ template class _CharT
 inline _TYPENAME numpunct_CharT::char_type
 numpunct_CharT::decimal_point () const
 {
-if (!(_C_flags  _RW::__rw_dp)) {
-
-numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this);
-
-// [try to] get the decimal point first (may throw)
-// then set a flag to avoid future initializations
-__self-_C_decimal_point  = do_decimal_point ();
-__self-_C_flags |= _RW::__rw_dp;
-}
-
-return _C_decimal_point;
+return do_decimal_point ();
 }
 
 
@@ -157,34 +138,14 @@ template class _CharT
 inline _TYPENAME numpunct_CharT::char_type
 numpunct_CharT::thousands_sep () const
 {
-if (!(_C_flags  _RW::__rw_ts)) {
-
-numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this);
-
-// [try to] get the thousands_sep first (may throw)
-// then set a flag to avoid future initializations
-__self-_C_thousands_sep  = do_thousands_sep ();
-__self-_C_flags |= _RW::__rw_ts;
-}
-
-return _C_thousands_sep;
+return do_thousands_sep ();
 }
 
 
 template class _CharT
 inline string numpunct_CharT::grouping () const
 {
-if (!(_C_flags  _RW::__rw_gr)) {
-
-numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this);
-
-// [try to] get the grouping first (may throw)
-// then set a flag to avoid future initializations
-__self-_C_grouping  = do_grouping ();
-__self-_C_flags|= _RW::__rw_gr;
-}
-
-return _C_grouping;
+return do_grouping ();
 }
 
 
@@ -192,17 +153,7 @@ template class _CharT
 inline _TYPENAME numpunct_CharT::string_type
 numpunct_CharT::truename () const
 {
-if (!(_C_flags  _RW::__rw_tn)) {
-
-numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this);
-
-// [try to] get the true name first (may throw)
-// then set a flag to avoid future initializations
-__self-_C_truename  = do_truename ();
-__self-_C_flags|= _RW::__rw_tn;
-}
-
-return _C_truename;
+return do_truename ();
 }
 
 
@@ -210,17 +161,7 @@ template class _CharT
 inline _TYPENAME numpunct_CharT::string_type
 numpunct_CharT::falsename () const
 {
-if (!(_C_flags  _RW::__rw_fn)) {
-
-numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this);
-
-// [try to] get the false name first (may throw)
-// then set a flag to avoid future initializations
-__self-_C_falsename  = do_falsename ();
-__self-_C_flags |= _RW::__rw_fn;
-}
-
-return _C_falsename;
+return do_falsename ();
 }
 
 // #endif _RWSTD_NO_EXT_NUMPUNCT_PRIMARY
-*- mode: org -*-

* Machines:

** iMac, Intel, 4 cores:

$ uname -a; gcc -v
Darwin imax 11.4.0 Darwin Kernel Version 11.4.0: Mon Apr  9 19:32:15 PDT 2012; 
root:xnu-1699.26.8~1/RELEASE_X86_64 x86_64
gcc version 4.7.1 (GCC) 

** Linux Slackware, AMD, 16 cores:

$ uname -a; gcc -v
Linux behemoth 2.6.37.6 #3 SMP Sat Apr 9 22:49:32 CDT 2011 x86_64 AMD 
Opteron(tm) Processor 6134 AuthenticAMD GNU/Linux
gcc version 4.5.2 (GCC) 

* Method

** Library

Apply the patch. Build an optimized library (I used 12S in all runs). Build the 
library, rwtest, and locale database:

$ nice make -Clib
$ nice make -Cbin locales
$ nice make -Crwtest 

Properly export the necessary envar if running against STDCXX locale
database or unset, otherwise:

$ export RWSTD_LOCALE_ROOT=/path/to/.../nls

** Test program

Place the multi-threaded program source file, t.cpp, in
srcdir/tests/localization and run make

Re: svn commit: r1389337 - /stdcxx/branches/4.2.x/tests/support/atomic_xchg.cpp

2012-09-26 Thread Liviu Nicoara

On 09/25/12 23:20, Martin Sebor wrote:

On 09/24/2012 06:38 AM, lnico...@apache.org wrote:

Author: lnicoara
Date: Mon Sep 24 12:38:17 2012
New Revision: 1389337

URL: http://svn.apache.org/viewvc?rev=1389337view=rev
Log:
2012-09-24  Liviu Nicoara  nikko...@hates.ms

 * tests/support/atomic_xchg.cpp: (run_test) moved counter volatile
   qualification to match STDCXX conventions (see r1388733)


The stdcxx ChangeLog format follows the GNU Coding Standard:
   http://www.gnu.org/prep/standards/html_node/Change-Logs.html

I.e., this entry should look like this:

 * tests/support/atomic_xchg.cpp (run_test): Move counter
   volatile qualification to match STDCXX convention (see
   r1388733).

(The colon follows the parenthesized function name. The text
should also end with a period.)

The svn propedit command lets you change a comment for an
already committed change.


Sorry about that. I fixed it.

L


Reopen closed incidents

2012-09-26 Thread Liviu Nicoara

I want to reopen the closed incidents, esp. 1056. On a second thought it might 
be more useful to open new ones and link the old ones so that we don't mess 
with ownership, etc. If nobody objects to this I will go forward with the 
latter.

Thanks.


STDCXX-1071 numpunct facet defect

2012-09-26 Thread Liviu Nicoara
I have created STDCXX-1071 and linked to STDCXX-1056. I have reduced the scope 
to numpunct because moneypunct is not failing for me. If someone has a 
moneypunct failure listing I want to see it.


I have reduced the library code to a failing test case. I have attached there 
the reduced program. The program shows the same failures like the library 
(interestingly enough it does not fail when removing the data caching).


I have (re)attached the forwarding patches I created for 1056, one binary 
compatible, the other not.


Martin, my wish is to move this to some completion. Please let me know if there 
is something else you think I should do to speed this up.


I am open to all questions, the more the better. Most of my opinions have been 
expressed earlier, but please ask if you want to know more.


Thanks,
Liviu


Re: STDCXX-1056 : numpunct fix

2012-09-25 Thread Liviu Nicoara

On 09/24/12 22:28, Martin Sebor wrote:

On 09/20/2012 06:46 PM, Stefan Teleman wrote:

On Thu, Sep 20, 2012 at 8:39 PM, Liviu Nicoaranikko...@hates.ms  wrote:


I have not created this requirement out of thin air. STDCXX development has 
functioned in this manner for as long as I remember. If it does not suit you, 
that's fine.


That would explain why these bugs are present in the first place.

If the official method of determining thread-safety here is
fprintf(3C), then we have a much bigger problem than
22.locale.numpunct.mt.


Tests that end in .mt.cpp use the RW_ASSERT() macro to verify
correctness in multithreaded programs. There is no way at the
moment to run a thread analyzer as part of the test suite,
although it would be a nice addition. As I said in my other
reply, though, not all data races necessarily indicate bugs,
so we'd need a way to distinguish between know/benign races
and the rest. (Unless we decide to eliminate all races and
accept the performance penalty.)


Preserving the lazy initialization is possible: we perfectly forward to the 
implementation with either:

1. no additional synchronization (preserves fast reads of the data), while 
fully knowing that there are data races in __rw_get_numpunct/_C_get_data there, 
or
2.we put a big lock on top of all and every facet operations (the version that 
would be race free).

Caching seems out of the window for now, more so in the presence of unguarded 
reads of facet's data.

The only argument for (1) is that at the moment we don't have a failing test 
yet, and not for lack of arduous trying.

However, if atomic/membar APIs would be recognized by the compiler, lazy 
initialization of the facet and caching of data would be attainable in a quite 
simple fashion.

Liviu


Re: STDCXX-1056 : numpunct fix

2012-09-25 Thread Liviu Nicoara

On 09/24/12 23:50, Stefan Teleman wrote:

On Mon, Sep 24, 2012 at 10:03 PM, Martin Sebor mse...@gmail.com wrote:

FWIW, there are race conditions in stdcxx. Some of them are by
design and benign on the systems the library runs on (although
I acknowledge that some others may be bugs). One such benign
date race is:

   timeT1  T2
 0x = N
 1x = N   read x

where x is a scalar that can be accessed atomically by the CPU
and the compiler.

I think some of the lazy facet initialization falls under this
class. It would be nice to fix them but only if it doesn't slow
things down. The others need to be fixed in any case.


The race conditions I am talking about are not benign.


Caching failures aside, we have no failing tests. Before saying they are malign 
you need to objectively show a failing program.



It shows the types of race conditions it's reporting: these are
read/write race conditions, not read/read. The thread analyzer's html
filter doesn't show the types of races reported as clearly as the
command-line analyzer which has a windowing GUI.


Martin's example above is a read-write race. In the absence of a failing test 
case it is quite unreasonable to modify the current implementation.

Liviu



STDCXX-1070

2012-09-25 Thread Liviu Nicoara

I filed 1070, failure to build 22.locale.collate.cpp on Linux with gcc 4.7.1. 
Gcc, Comeau and Clang fail to compile it, Intel and Sun are fine. It looks to 
me like Intel and Sun compilers are not doing the right thing. A small test 
case and a patch have been attached. The failing code has been reduced to:

$ cat test.cpp; g++ -c test.cpp
template  class charT 
void f ()
{
g (charT ('a'));
}

template  class charT 
void g (charT)
{
}

int h ()
{
return f char  (), 0;
}
test.cpp: In instantiation of 'void f() \[with charT = char\]':
test.cpp:14:23:   required from here
test.cpp:4:5: error: 'g' was not declared in this scope, and no declarations 
were found by argument-dependent lookup at the point of instantiation 
\[-fpermissive\]
test.cpp:8:6: note: 'templateclass charT void g(charT)' declared here, later 
in the translation unit

Liviu


Re: [jira] [Closed] (STDCXX-1056) std::moneypunct and std::numpunct implementations are not thread-safe

2012-09-25 Thread Liviu Nicoara

On 9/25/12 7:56 PM, Stefan Teleman (JIRA) wrote:


  [ 
https://issues.apache.org/jira/browse/STDCXX-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Stefan,

I don't think it's ok to close this bug. The race conditions are there and we 
have not come to a completely satisfactory conclusion on how to deal with them, 
or even if we should deal with them. Whichever it is we gotta keep this 
discussion open. I sure hope you want to be a part of it.


FWIW, I have spent quite some time looking at your proposed patch and I am going 
to reopen the incident. If I can.


Liviu



Stefan Teleman closed STDCXX-1056.
--

 Resolution: Won't Fix

Bug will not be fixed. Upstream refuses to acknowledge the existence of the bug 
in spite of objective evidence to the contrary.


std::moneypunct and std::numpunct implementations are not thread-safe
-

 Key: STDCXX-1056
 URL: https://issues.apache.org/jira/browse/STDCXX-1056
 Project: C++ Standard Library
  Issue Type: Bug
  Components: 22. Localization
Affects Versions: 4.2.1, 4.2.x, 4.3.x, 5.0.0
 Environment: Solaris 10 and 11, RedHat and OpenSuSE Linux, Sun C++ 
Compilers 12.1, 12.2, 12.3
Issue is independent of platform and/or compiler.
Reporter: Stefan Teleman
  Labels: thread-safety
 Fix For: 4.2.x, 4.3.x, 5.0.0

 Attachments: 22.locale.numpunct.mt.out, facet.cpp.diff, 
locale_body.cpp.diff, punct.cpp.diff, runtests-linux32-all.out, 
runtests-linux64-all.out, runtests.out, STDCXX-1056-additional-timings.tgz, 
stdcxx-1056.patch, stdcxx-1056-timings.tgz, 
stdcxx-4.2.x-numpunct-perfect-forwarding.patch, 
stdcxx-4.3.x-numpunct-perfect-forwarding.patch


several member functions in std::moneypunct and std::numpunct return
a std::string by value (as required by the Standard). The implication of 
return-by-value
being that the caller owns the returned object.
In the stdcxx implementation, the std::basic_string copy constructor uses a 
shared
underlying buffer implementation. This shared buffer creates the first problem 
for
these classes: although the std::string object returned by value *appears* to 
be owned
by the caller, it is, in fact, not.
In a mult-threaded environment, this underlying shared buffer can be 
subsequently modified by a different thread than the one who made the initial 
call. Furthermore, two or more different threads can access the same shared 
buffer at the same time, and modify it, resulting in undefined run-time 
behavior.
The cure for this defect has two parts:
1. the member functions in question must truly return a copy by avoiding a call 
to the copy constructor, and using a constructor which creates a deep copy of 
the std::string.
2. access to these member functions must be serialized, in order to guarantee 
atomicity
of the creation of the std::string being returned by value.
Patch for 4.2.1 to follow.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira





Re: [jira] [Closed] (STDCXX-1066) SPARCV8 requires pthread_mutex_t and pthread_cond_t to be aligned on an 8-byte boundary

2012-09-25 Thread Liviu Nicoara

On 9/25/12 7:56 PM, Stefan Teleman (JIRA) wrote:


  [ 
https://issues.apache.org/jira/browse/STDCXX-1066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Anybody around here, except Stefan, who has access to a SPARC V8 machine updated 
to the specified kernel update or later, and who is willing to run a simple test 
program? It's a 5 minute job at most.


Thanks!



Stefan Teleman closed STDCXX-1066.
--

 Resolution: Won't Fix

Bug will not be fixed upstream. It is fixed in the Solaris releases.


SPARCV8 requires pthread_mutex_t and pthread_cond_t to be aligned on an 8-byte 
boundary
---

 Key: STDCXX-1066
 URL: https://issues.apache.org/jira/browse/STDCXX-1066
 Project: C++ Standard Library
  Issue Type: Bug
  Components: Thread Safety
Affects Versions: 4.2.1, 4.2.2, 4.2.x, 4.3.x
 Environment: Solaris 10 Update 6 or later on SPARCV8 [32-bit]
Defect is compiler-independent - in reality it only affects Sun Studio and GCC.
Reporter: Stefan Teleman
  Labels: features, runtime, threads
 Fix For: 4.2.2, 4.2.x, 4.3.x

 Attachments: _config-gcc.h.stdcxx-1066.patch, 
_config-sunpro.h.stdcxx-1066.patch, ctype.cpp.stdcxx-1066.patch, 
exception.cpp.stdcxx-1066.patch, ios.cpp.stdcxx-1066.patch, 
iostream.cpp.stdcxx-1066.patch, iostream.stdcxx-1066.patch, 
locale_body.cpp.stdcxx-1066.patch, locale_classic.cpp.stdcxx-1066.patch, 
messages.cpp.stdcxx-1066.patch, _mutex.h.stdcxx-1066.patch, 
time_put.cpp.stdcxx-1066.patch, use_facet.h.stdcxx-1066.patch


Starting with Solaris 10 Update 6, on SPARCV8, pthread_mutex_t and 
pthread_cond_t MUST be aligned on an 8-byte boundary. Misaligned access will 
result in either SEGV or SIGBUS.
There are numerous places in the multi-threaded version of stdcxx where 
pthread_mutex_t and/or pthread_cond_t types are contained within an union, but 
with an enforced alignment different than 8. All these instances must be 
corrected, and #ifdef-guarded for SPARCV8.
Patches to follow shortly, this is just opening the issue.
Warning: the patchset resolving this issue is very large, and it affects a 
large number of files.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira





Re: STDCXX-1056 : numpunct fix

2012-09-25 Thread Liviu Nicoara

On 9/24/12 11:06 PM, Stefan Teleman wrote:

On Mon, Sep 24, 2012 at 7:48 PM, Liviu Nicoara nikko...@hates.ms wrote:


Stefan, was it your intention to completely eliminate all the race
conditions with this last patch? Is this what the tools showed in your
environment?



https://issues.apache.org/jira/browse/STDCXX-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13461452#comment-13461452


Yes, all the race conditions in std::numpunct and std::moneypunct.
Not all the race conditions in stdcxx.


FWIW, I have analyzed the performance of the latest patch from Stefan. The test 
program is the same test program used in the timings attachments to the 
incident. The version of the library was 4.2.1, compiler gcc 4.7.1, with:


1. The simple, perfect forwarding patch, with no caching, which does eliminate 
race conditions.

2. Stefan's latest patch
3. A patch I created which truly, if crudely, eliminates all facet race 
conditions by using a top-level lock for all facet operations(*)


The numbers from the first test run are already available in the incident.

The numbers for the second test run were a bit bigger, which is expected because 
of the additional synchronization. E.g., I have found that the test, run against 
the STDCXX locale database, gives:


$ time ./t en_US.UTF-8 4 500
4, 500

real0m19.210s
user0m32.659s
sys 0m36.507s

where the first patch gives:

$ time ./t en_US.UTF-8 4 5000; done
4, 5000

real0m7.961s
user0m31.211s
sys 0m0.003s

Notice the number of loops, 10x larger in the second set of number.

Now, for the third patch, the numbers are embarrassingly large. It practically 
slows the program to a crawl because it does not do any reads of facet data 
without holding the lock. This, I wager, eliminates all races in the 
numpunct/moneypunct facet. See:


$ time ./t en_US.UTF-8 4 10
4, 10

real0m24.316s
user0m28.213s
sys 0m6.633s

Now that is a whooping performance hit.

I went back to see an explanation between the numbers I see with Stefan's more 
performing patch, and my lock-all patch. I believe that there are still places 
in Stefan's patch where the facet makes unguarded reads of facet data, e.g.:


loc/_numpunct.h:
190 // returns a pointer to the facet's implementation data
191 // if it exists, 0 otherwise
192 const void* _C_data () const {
193 return _C_impsize ? _C_impdata
194 : _RWSTD_CONST_CAST (__rw_facet*, this)-_C_get_data ();
195 }

where _C_impsize/_C_impdata are read outside of a lock scope, etc. I.e., a 
thread may read the _C_impsize while, say, another is still writing the 
_C_impdata member variable.


I conclude, even with such simple testing, that any solution that attempts to 
completely eliminate the race conditions in these facets will incur a penalty, 
with the safest (top-level locking) the worst by orders of magnitude and any 
other solution somewhere in between.


Please let me know if you have any questions.

Liviu

(*) Will attach it as soon as we figure out how to undo the closing of the 
issue.


Re: [PATCH] STDCXX-853

2012-09-24 Thread Liviu Nicoara

On 09/24/12 01:18, Travis Vitek wrote:


Liviu,

Should the volatile be to the left of the intT typename here? I know it is 
equivalent, but it is weird to look at the line of code below and see that 
we're following two different conventions.



Thanks, will do.


Travis
___
From: Liviu Nicoara
Sent: Sunday, September 23, 2012 8:34 AM
To: dev@stdcxx.apache.org
Subject: [PATCH] STDCXX-853

Umm, I didn't think to search for a corresponding incident and I considered the 
defect to be so minor as to not warrant an issue. The following patch has been 
applied already on 4.2.x:

Index: tests/support/atomic_xchg.cpp
===
--- tests/support/atomic_xchg.cpp   (revision 1388732)
+++ tests/support/atomic_xchg.cpp   (revision 1388733)
@@ -297,7 +297,7 @@ void run_test (intT, thr_args_base::tag_
   // compute the expected result, skipping zeros by incrementing
   // expect twice when it overflows and wraps around to 0 (zero is
   // used as the lock variable in thread_routine() above)
-intT expect = intT (1);
+intT volatile expect = intT (1);

   const unsigned long nincr = (Args::nthreads_ * Args::nincr_) / 2U;




--
And now I see with eye serene
The very pulse of the machine.


Re: STDCXX-1056 : numpunct fix

2012-09-24 Thread Liviu Nicoara

On 9/24/12 12:06 AM, Stefan Teleman wrote:

On Fri, Sep 21, 2012 at 9:10 AM, Liviu Nicoara nikko...@hates.ms wrote:

On 09/21/12 05:13, Stefan Teleman wrote:


On Fri, Sep 21, 2012 at 2:28 AM, Travis Vitek
travis.vi...@roguewave.com wrote:

I have provided this list with test results showing that my patch
*does* fix the race condition problems identified by all the tools at
my disposal. I'm willing to bet you never looked at it. You dismissed
it outright, just because you didn't like the *idea* that increasing
the size of the cache, and eliminating that useless cache resizing
would play an important role in eliminating the race conditions.




I looked at it in great detail and I am sure Travis read it too. The
facet/locale buffer management is a red herring. Apparently, we cannot
convince you, regardless of the argument. That's fine by me.


This bug [STDCXX-1056] was updated over the weekend with new comments.
Here's the link to the comments, for the record:


Stefan, was it your intention to completely eliminate all the race conditions 
with this last patch? Is this what the tools showed in your environment?


Thanks,
Liviu



https://issues.apache.org/jira/browse/STDCXX-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13461452#comment-13461452





Re: STDCXX-1066 [was: Re: STDCXX forks]

2012-09-23 Thread Liviu Nicoara

On 09/16/12 12:03, Stefan Teleman wrote:

On Sat, Sep 15, 2012 at 5:47 PM, Liviu Nicoara nikko...@hates.ms wrote:


I merely wanted to point out that restoring the default packing is not the
same as restoring the packing to the previous value in effect.

Given this, I thought about an alternative way of forcing this alignment,
e.g., via a union, aligned on an appropriate type. I see an advantage here
in that most of the changes would occur where we define the 'primitive'
mutex and condition wrappers, saving a few pre-processor conditionals and
pragmas along the way.


I understood what you were saying. I just don't understand under what
_sane_ circumstances a program would #include a system library header
file with a previously set packing to something other than default. Or
would expect a non-default packing to work during a function call to a
sytem library. That's an insane configuration on any operating system
that I know of, not just on SPARCV8.


I can't think of a valid scenario, either. I guess the point is moot.

A few more questions, if you will, as I am going through the changes:

1. I see similarities with 1040, should/would you close that one?
2. The issue only exists in MT builds, should there be a guard in configs?
3. The align reference docs talk only about aligning variables, not types. Is 
that different on SPARC?
4. I see rw/_mutex.h has alignment pragmas for both __rw_mutex_base class and 
its mutex member; same for __rw_static_mutex and its static member, etc. How 
does that work?
5. Why is __rw_guard aligned explicitly? I see it only contains a pointer to a 
mutex object.
6. The docs mention that the pragma must use the mangled variables names but I 
don't see that in the patch.

Thanks!

Liviu





[PATCH] STDCXX-853

2012-09-23 Thread Liviu Nicoara

Umm, I didn't think to search for a corresponding incident and I considered the 
defect to be so minor as to not warrant an issue. The following patch has been 
applied already on 4.2.x:

Index: tests/support/atomic_xchg.cpp
===
--- tests/support/atomic_xchg.cpp   (revision 1388732)
+++ tests/support/atomic_xchg.cpp   (revision 1388733)
@@ -297,7 +297,7 @@ void run_test (intT, thr_args_base::tag_
 // compute the expected result, skipping zeros by incrementing
 // expect twice when it overflows and wraps around to 0 (zero is
 // used as the lock variable in thread_routine() above)
-intT expect = intT (1);
+intT volatile expect = intT (1);
 
 const unsigned long nincr = (Args::nthreads_ * Args::nincr_) / 2U;
 


[PATCH] STDCXX-1069 [was: Re: SUNPro 5.12 compilation failure in optimized Linux builds]

2012-09-23 Thread Liviu Nicoara

On 09/22/12 00:51, Liviu Nicoara wrote:

Optimized (but not debug) builds fail to compile setlocale.cpp with the error:


A patch and a comment have been attached to the issue.

Thanks,
Liviu


Re: STDCXX-1066 [was: Re: STDCXX forks]

2012-09-23 Thread Liviu Nicoara

On 9/23/12 3:48 PM, Stefan Teleman wrote:

On Sun, Sep 23, 2012 at 3:26 PM, Liviu Nicoara nikko...@hates.ms wrote:


To be honest it's quite bizarre that you cannot share that with us. Is it
really a trade secret? How can that be the case if Oracle customers are also
required to perform the same alignment, perhaps using the same techniques
like you showed in the patch?


That's the problem. I don't know what is and what is not a trade
secret, or copyrighted information, or unauthorized intellectual
property disclosure anymore.  I think it's in the eye of the
beholder. At Sun it was very clear.

Believe it or not, I had to get written approval from the Legal
Counsel's Office in order to be able to share these patches. And that
in spite of the fact that these patches are published, and have
already been published for *years*.


That clarifies things. Thanks.

Liviu



Re: STDCXX-1066 [was: Re: STDCXX forks]

2012-09-23 Thread Liviu Nicoara

On 9/23/12 5:50 PM, Stefan Teleman wrote:

On Sun, Sep 23, 2012 at 5:23 PM, Stefan Teleman
stefan.tele...@gmail.com wrote:


The second URL says this:

QUOTE
Due to a change in the implementation of the userland mutexes
introduced by CR 6296770 in KU 137111-01, objects of type mutex_t and
pthread_mutex_t must start at 8-byte aligned addresses. If this
requirement is not satisfied, all non-compliant applications on
Solaris/SPARC may fail with the signal SEGV with a callstack similar
to the following one or with similar callstacks containing the
function mutex_trylock_process.

   \*_atomic_cas_64(0x141f2c, 0x0, 0xff00, 0x1651, 0xff00, 0x466d90)
   set_lock_byte64(0x0, 0x1651, 0xff00, 0x0, 0xfec82a00, 0x0)
   fast_process_lock(0x141f24, 0x0, 0x1, 0x1, 0x0, 0xfeae5780)

/QUOTE


Here's a link to an official datatype alignment table for SPARCV8:

http://docs.oracle.com/cd/E19205-01/819-5267/bkbkl/index.html

The interesting table is:

Table B–2 Storage Sizes and Default Alignments in Bytes


I see nothing really outstanding here. What is it that I should pay attention 
to?

Thanks,
Liviu


STDCXX-1056 DCII [was: Re: STDCXX-1056 [was: Re: STDCXX forks]]

2012-09-22 Thread Liviu Nicoara

On 9/17/12 5:42 PM, Liviu Nicoara wrote:

[...]
However, after more careful thought, I think there is a problem there even
though we don't have an objective proof for it, yet.

The writes are not atomic and they function just like DCII, being subject to
both compiler reordering and out of order execution. E.g., it is assumed that
the writes to the _C_impsize and _C_impdata members occur in the program order,
which would make unguarded reads/tests of _C_impsize safe. This is what the
facet initialization and access code does, conceptually:

if (_C_impsize) {
 // already initialized, use data
}
else {
 // initialize
 mutex.lock ();
 _C_impdata = get_data_from_database ();
 _C_impsize = get_data_size ();
 mutex.unlock ();
}

The mutex lock and unlock operations introduce memory barriers but they are not
guaranteeing the order in which the two writes occur. If the writes would occur
in the exact program order, unguarded reads and tests of _C_impsize would be
safe. Unfortunately, a thread may see the _C_impsize variable being updated
before the _C_impdata.


Elaborating on the above -- facet.cpp:~250, the code is actually closer to this:

if (_C_impsize) {
 // already initialized, use _C_impdata
}
else {
 // initialize
 mutex.lock ();

 if (_C_impsize)
 return _C_impdata;

 void* const pv = ...;
 size_t sz = ...;

 _C_impdata = pv;   // 1
 _C_impsize = sz;   // 2

 mutex.unlock ();
}

(1) and (2) can be reordered. Also, when mapping the STDCXX locale database, 
facet.cpp:288, there is another defect writing the two variables:


   _C_impdata = __rw_get_facet_data (cat, _C_impsize, 0, codeset);
  ^^^
  |||
Set early in __rw_get_facet_data -+++

The only way to order the above writes, preserve the fast checking for 
initialization, and generally salvaging the current algorithm is a full memory 
barrier in between the writes, e.g.:


if (_C_impsize) {
 // already initialized, use data
}
else {
 // ...

 _C_impdata = pv;
 full_membar ();
 _C_impsize = sz;

 // ...
}

Atomics are part of C++11, I wonder if anybody here has working experience with 
the atomic API. A mutex locking will not do because the lock is only acquiring 
and the write before the lock can shift down.


Of course I could be wrong. Any input is welcome.

Thanks,
Liviu


STDCXX-1068 and alignment

2012-09-22 Thread Liviu Nicoara
Stefan, could you please elaborate on the misaligned reads/writes that you 
observed on those platforms? What was failing?


Thanks,
Liviu


Re: STDCXX-1056 : numpunct fix

2012-09-21 Thread Liviu Nicoara

On 09/21/12 05:13, Stefan Teleman wrote:

On Fri, Sep 21, 2012 at 2:28 AM, Travis Vitek
travis.vi...@roguewave.com wrote:

I have provided this list with test results showing that my patch
*does* fix the race condition problems identified by all the tools at
my disposal. I'm willing to bet you never looked at it. You dismissed
it outright, just because you didn't like the *idea* that increasing
the size of the cache, and eliminating that useless cache resizing
would play an important role in eliminating the race conditions.



I looked at it in great detail and I am sure Travis read it too. The 
facet/locale buffer management is a red herring. Apparently, we cannot convince 
you, regardless of the argument. That's fine by me.




I have yet to see an alternative patch being proposed here, which
would eliminate the race conditions, and which I am willing to test
just as thoroughly as I've tested mine. The only thing i've seen are
continued attempts at dismissing the existence of these race
conditions, as  reported by the instrumentation tools, based on some
general axioms about their accuracy. No-one on this list has a clue as
to how the SunPro Thread Analyzer actually works, because it's not
open source, and none of you work at Oracle, therefore you can't see
the code. But everyone just *knows* that it's inaccurate, or that it
reports false positives.


We are not dismissing the analysis. We are dismissing your interpretation of 
it. The tool is not indicating a defect, but a potential defect. I am calling 
it potential because, as my test case (http://tinyurl.com/97ks9gc) indicates, 
one can have a race condition which is not a defect in a concurrent environment.



As long as you, or anyone else, continues to blame the instrumentation
tools for the bug, and as long as anyone here continues to suggest
that the only acceptable proof of the existence of this bug is some
other program which needs to be written using fprintf(3C), and as long
as no-one here is willing to provide an alternative patch which
demonstrably eliminates 100% of the reported race conditions, this
entire back-and-forth about the existence of these race conditions,
the accuracy of the tools and what they are reporting is nothing but a
giant waste of time.



The correction of 1056 does not necessarily need to eliminate the analyzer 
warnings. Aiming for that is simply unreasonable, not because it is not 
attainable (after all, you demonstrated it), but because it is inappropriately 
and insufficiently argued for the situation at hand.

Liviu


SUNPro 5.12 compilation failure in optimized Linux builds

2012-09-21 Thread Liviu Nicoara

Optimized (but not debug) builds fail to compile setlocale.cpp with the error:

$ cat t.cpp; CC -c t.cpp
#define _XOPEN_SOURCE
#include cwchar
/opt/sunpro/12_3/prod/include/cc/wchar.h, line 90: Error: tm is not defined.
/opt/sunpro/12_3/prod/include/cc/wchar.h, line 92: Error: fgetwc is not 
defined.
...

and so on for every wide-char function in wchar.h.

The definition of _XOPEN_SOURCE at the top of setlocale.cpp is not entirely 
correct in that it interacts with the various guards in the C library headers. 
AFAICT, Oracle/Sun wchar.h includes C library wchar.h, which includes C library 
wctype.h, which includes again wchar.h, coming full circle to the compiler 
wchar.h which uses the names w/o them being defined, yet. The fact that 
Oracle/Sun headers include_next corresponding C library headers outside of a 
inclusion guard does not help.


(Debug builds are not affected because of interaction from the difference in the 
structure and includes of rw/_traits.h in debug vs. optimized builds).


The poor man's fix is to guard that _XOPEN_SOURCE define for SUNPro builds (see 
below). However, I am not sure whose side the error is.


Thanks,
Liviu

PS. Is it still called SUNPro? Oracles seems it has sanitized that name out on 
their website.


Index: src/setlocale.cpp
===
--- src/setlocale.cpp   (revision 1388733)
+++ src/setlocale.cpp   (working copy)
@@ -34,8 +34,10 @@
 #include rw/_defs.h

 #if defined (__linux__)  !defined (_XOPEN_SOURCE)
+#  if !defined (__SUNPRO_CC)
// need S_IFDIR on Linux
 #  define _XOPEN_SOURCE
+#  endif // !__SUNPRO_CC
 #endif   // __linux__  !_XOPEN_SOURCE

 #include locale.h   // for setlocale()


Re: STDCXX-1056 : numpunct fix

2012-09-20 Thread Liviu Nicoara
Thanks for the feed-back. Please see below.


On Sep 19, 2012, at 10:02 PM, Stefan Teleman wrote:

 On Wed, Sep 19, 2012 at 8:51 PM, Liviu Nicoara nikko...@hates.ms wrote:
 
 I think you are referring to `live' cache objects and the code which
 specifically adjusts the size of the buffer according to the number of
 `live' locales and/or facets in it. In that respect I would not call that
 eviction because locales and facets with non-zero reference counters are
 never evicted.
 
 But anyhoo, this is semantics. Bottom line is the locale/facet buffer
 management code follows a principle of economy.
 
 Yes it does. But we have to choose between economy and efficiency. To
 clarify: The overhead of having unused pointers in the cache is
 sizeof(void*) times the number of unused slots.  This is 2012. Even
 an entry-level Android cell phone comes with 1GB system memory. If we
 want to talk about embedded systems, where memory constraints are more
 stringent than cell phones, then we're not talking about Apache stdcxx
 anymore, or any other open souce of the C++ Standard Library. These
 types of systems use C++ for embedded systems, which is a different
 animal altogether: no exceptions support, no rtti. For example see,
 Green Hills: http://www.ghs.com/ec++.html. And even they have become
 more relaxed about memory constraints. They use BOOST.
 
 Bottom line: so what if 16 pointers in this 32 pointer slots cache
 never get used. The maximum amount of wasted memory for these 16
 pointers is 128 bytes, on a 64-bit machine with 8-byte sized pointers.
 Can we live with that in 2012, a year when a $500 laptop comes with
 4GB RAM out of the box? I would pick 128 bytes of allocated but unused
 memory over random and entirely avoidable memory churn any day.


The argument is plausible and fine as far as brainstorming goes. 

But have you measured the amount of memory consumed by all STDCXX locale data 
loaded in one process? How much absolute time is spent in resizing the locale 
and facet buffers? What is the gain in space and time performance with such a 
change versus without? Just how fragmented the heap becomes and is there a 
performance impact because of it, etc.? IOW, before changing the status quo one 
must show an objective defect, produce a body of evidence, including a failing 
test case for the argument.


 
 My goal: I would be very happy if any application using Apache stdcxx
 would reach its peak instantiation level of localization (read: max
 number of locales and facets instantiated and cached, for the
 application's particular use case), and would then stabilize at that
 level *without* having to resize and re-sort the cache, *ever*. That
 is a locale cache I can love. I love binary searches on sorted
 containers. Wrecking the container with insertions or deletions, and
 then having to re-sort it again, not so much. Especially when I can't
 figure out why we're doing it in the first place.


And I love minimalistic code, and hate waste at the same time, especially in a 
general purpose library. To each its own.


 
 Hey Stefan, are the above also timing the changes?
 
 Nah, I didn't bother with the timings - yet - for a very simple
 reason: in order to use instrumentation, both with SunPro and with
 Intel compilers, optimization of any kind must be disabled. On SunPro
 you have to pass -xkeepframe=%all (which disables tail-call
 optimization as well), in addition to passing -xO0 and -g. So the
 timings for these unoptimized experiments would have been completely
 irrelevant.

Well, I think you are the only one around here with access to SPARC hardware, 
your input is very precious in this sense. Also, this is the reason for which I 
kept asking that question earlier: do we have currently any failing locale MT 
test when numpunct does just perfect forwarding, with no caching? I.e., 
changing just _numpunct.h and no other source file (as to silence thread 
analyzers warnings) does any locale (or other) MT tests fail? I would greatly 
appreciate it if you could give it a run on your hardware if you don't already 
know the answer.

The discussion has been productive. But I object to the patch as is because it 
goes out of the scope of the original incident. I think this patch should only 
touch the MT defect detected by the failing test cases. If you think the other 
parts you changed are defects you should open corresponding issues in JIRA and 
have them discussed in their separate rooms.

Thanks,
Liviu

Re: STDCXX-1056 : numpunct fix

2012-09-20 Thread Liviu Nicoara

On 09/20/12 13:11, Stefan Teleman wrote:

On Thu, Sep 20, 2012 at 8:07 AM, Liviu Nicoara nikko...@hates.ms wrote:

But have you measured the amount of memory consumed by all STDCXX locale data 
loaded in one process? How much absolute time is spent in resizing the locale 
and facet buffers? What is the gain in space and time performance with such a 
change versus without? Just how fragmented the heap becomes and is there a 
performance impact because of it, etc.? IOW, before changing the status quo one 
must show an objective defect, produce a body of evidence, including a failing 
test case for the argument.


sizeof(std::locale) * number_of_locales.


I was more interested in the underlying locale data, not the C++ objects.



I'll let you in on a little secret: once you call setlocale(3C) and
localeconv(3C), the Standard C Library doesn't release its own locale
handles until process termination. So you might think you save a lot
of memory by destroying and constructing the same locales. You're
really not. It's the Standard C Library locale data which takes up a
lot of space.


Thanks for the secret, I appreciate it. Did you mean to say that the C Standard 
mandates that?!



What I do know for a fact that this optimization did, was to cause
the races conditions reported by 4 different thread analyzers. Race
conditions are a show-stopper for me, and they are not negotiable.


No, that optimization was not causing the MT defect you originally noted. 
Saying so only shows a lack of familiarity with the implementation.




And I love minimalistic code, and hate waste at the same time, especially in a 
general purpose library. To each its own.


Here's a helpful quote:

We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil. It's from Donald
Knuth.


Please, no.



And I love correct code which doesn't cause thread analyzers to report
more than 12000 race conditions for just one test case. I've said it
before and I will say it again: race conditions are a showstopper and
are not negotiable. Period.



Stefan, you are not being correct by a consensus of thread analyzers output or 
by being loud, or by pounding your fist on the table. This being said I will 
continue to exercise due diligence and put in the necessary time and effort to 
provide you with the most useful feed-back I can.

I see that you missed my question in the post before: did you see a failure in 
the locale MT tests in your SPARC runs? If you do not want to run that test, 
that is fine, just let me know.

Thanks,
Liviu




Re: STDCXX-1056 : numpunct fix

2012-09-20 Thread Liviu Nicoara

On Sep 20, 2012, at 5:31 PM, Stefan Teleman wrote:

 On Thu, Sep 20, 2012 at 5:07 PM, Liviu Nicoara nikko...@hates.ms wrote:

 
 To answer your question [...]:
 yes, the MT failures occur on SPARC as well, on both SPARCV8 and
 SPARCV9, and the race conditions are reported on *ALL* plaforms
 tested, even after having applied your _numpunct.h patch. This patch
 alone does *NOT* solve the problem.

Stefan, I want to be clear. You are talking about a patch identical in nature 
to the one I have attached now. Just want to be 100% sure we are talking about 
the same thing. This one still produces failures (crashes, assertions, etc.) in 
the locale MT tests on SPARC and elsewhere in your builds?

Thanks,
Liviu




Re: STDCXX-1056 : numpunct fix

2012-09-20 Thread Liviu Nicoara

On Sep 20, 2012, at 7:37 PM, Stefan Teleman wrote:

 On Thu, Sep 20, 2012 at 7:34 PM, Wojciech Meyer
 wojciech.me...@googlemail.com wrote:
 Hi,
 
 My perceptions is by reading through the whole thread - we should not
 trust 100% external tools to asses the safety of the code. I don't think
 there exist an algorithm that produces no false positives.
 
 That's said I admire Stefan's approach, but we should ask the question
 are we MT safe enough? I would say from what I read here: yes.
 
 Based on what objective metric?


The only gold currency that anyone in here accepts without reservations are 
failing test cases. I believe I have seen some exceptions to the golden rule in 
my RW time, but I can't recall any specific instance.

Liviu 



Re: STDCXX-1056 : numpunct fix

2012-09-20 Thread Liviu Nicoara

On Sep 20, 2012, at 5:23 PM, Stefan Teleman wrote:

 On Thu, Sep 20, 2012 at 4:45 PM, Travis Vitek
 travis.vi...@roguewave.com wrote:
 
 
 I'll let you in on a little secret: once you call setlocale(3C) and
 localeconv(3C), the Standard C Library doesn't release its own locale
 handles until process termination. So you might think you save a lot
 of memory by destroying and constructing the same locales. You're
 really not. It's the Standard C Library locale data which takes up a
 lot of space.
 
 You have a working knowledge of all Standard C Library implementations?
 
 I happen to do, yes, for the operating systems that I've been testing
 on. I also happen to know that you don't. This fact alone pretty much
 closes up *this* particular discussion.
 
 Do yourself, and this mailing list a favor: either write a patch which
 addresses all of your concerns *AND* eliminates all the race
 conditions reported, or stop this pseudo software engineering bullshit
 via email.
 
 There is apparently, a high concentration of know-it-alls on this
 mailing list, who are much better at detecting race conditions and
 thread unsafety than the tools themselves. Too bad they aren't as good
 at figuring out their own bugs.


The sniping is uncalled for. There are no enemies here, no one is after you. 
There is criticism though and you are expected to take it and argue your point 
of view. If you can't stand the heat, get out of the kitchen.


 
 It took eight months for anyone here to even *acknowledge* that
 numpunct and moneypunct do have, in fact, a thread safety problem.
 Never mind that the test case for these facets had been crashing for 4
 years. To be quite blunt and to the point, after 8 months of denying
 obvious facts, your credibility is quite a bit under question at this
 point.


Yes, we are busy with other stuff. I wish I got paid to work on this instead.


 
 This entire discussion has become a perfect illustration with what's
 wrong with the ASF, as reported here:
 
 http://www.mikealrogers.com/posts/apache-considered-harmful.html


I actually read it. I see a guy complaining he can't have it his way. No 
problem. One can fork this project at any time and start it anew, by 
themselves, or in the company of like programmers elsewhere. 

For better or worse Apache got STDCXX from RogueWave. Complaining about it is 
like complaining that Apple doesn't give us iPhones for free; after all we are 
the power users and we know what to do with them.

L

Re: STDCXX-1056 : numpunct fix

2012-09-20 Thread Liviu Nicoara

On Sep 20, 2012, at 7:45 PM, Stefan Teleman wrote:

 On Thu, Sep 20, 2012 at 7:22 PM, Liviu Nicoara nikko...@hates.ms wrote:
 
 Stefan, I want to be clear. You are talking about a patch identical in 
 nature to the one I have attached now. Just want to be 100% sure we are 
 talking about the same thing. This one still produces failures (crashes, 
 assertions, etc.) in the locale MT tests on SPARC and elsewhere in your 
 builds?
 
 On September 17, 2012 I have posted the following message to this list:
 
 http://www.mail-archive.com/dev@stdcxx.apache.org/msg01929.html
 
 In that message, there is a link to my SPARC thread-safety test results:
 
 http://s247136804.onlinehome.us/stdcxx-1056-SPARC-20120917/22.locale.numpunct.mt.nts.1.er.html/index.html
 
 This test was run with the following _numpunct.h file:
 
 http://s247136804.onlinehome.us/stdcxx-1056-SPARC-20120917/22.locale.numpunct.mt.nts.1.er.html/file.14.src.txt.html
 
 The test above shows 12440 race conditions detected for a test run of
 22.locale.numpunct.mt, with  --nthreads=8 --nloops=1.
 
 Did you ever look at these test results? From reading your email, I
 realize that you never looked at it. That is the only possible
 explanation as to why you're asking now for SPARC test results, today
 being September 20, 2012.


I see, there is a confusion about this. Probably nobody explained it before. A 
failing test case means a test case that causes the abnormal termination of the 
execution of the program or creates evidence of abnormal data in the program 
execution.

In this respect please see the atomic add and exchange tests as classical 
examples of what I mean.

I have read all your emails in detail and I have inspected all your 
attachments, modulo the ones I could not open.

Thanks,
Liviu




Re: STDCXX-1056 : numpunct fix

2012-09-20 Thread Liviu Nicoara

On Sep 20, 2012, at 8:02 PM, Stefan Teleman wrote:

 On Thu, Sep 20, 2012 at 7:52 PM, Liviu Nicoara nikko...@hates.ms wrote:
 
 On Sep 20, 2012, at 7:49 PM, Stefan Teleman wrote:
 
 On Thu, Sep 20, 2012 at 7:40 PM, Liviu Nicoara nikko...@hates.ms wrote:
 
 The only gold currency that anyone in here accepts without reservations 
 are failing test cases. I believe I have seen some exceptions to the 
 golden rule in my RW time, but I can't recall any specific instance.
 
 That may be a valid metric here.
 
 The only one. Any programmer worth his salt -- I am borrowing your words 
 here -- would be able to demonstrate the validity of his point of view with 
 a test case.
 
 I did. There are 12440 race conditions detected for an incomplete run
 of 22.locale.numpunct.mt. By incomplete I mean: it did not run with
 its default nthreads and nloops which I believe are 8 threads and
 20 loop iterations.


That is not it, and you did not. Please pay attention: given your assertion 
that a race condition is a defect that causes an abnormal execution of the 
program during which the program sees abnormal, incorrect states (read: 
variable values) it should be easy for you to craft a program that shows 
evidence of that by either printing those values, or aborting upon detecting 
them, etc.

 
 [...] and overall just email bullshit.

Stop using that word. 

L

Re: STDCXX-1056 : numpunct fix

2012-09-20 Thread Liviu Nicoara

On Sep 20, 2012, at 8:59 PM, Stefan Teleman wrote:

 On Thu, Sep 20, 2012 at 8:44 PM, C. Bergström
 cbergst...@pathscale.com wrote:
 
 
 fencepost comment - The results are based on tools and I don't think he has
 a large program which actually triggers the conditions.  (Creating one may
 take quite some time)
 
 I do have a program which triggers the race conditions:
 22.locale.numpunct.mt. Part of the official test harness.
 
 The real reason why they don't want to accept what the instrumentation
 tools are saying is very simple: they don't LIKE reading what the
 tools are saying. So, blame the tools, pretend that as long as it
 doesn't crash again there's no bug and hope for the best.

I cannot include an analyzer output as a regression test in the test suite. 

 
 But I am very glad this is on a public mailing list, so everyone can
 read what's going on here.
 

?

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-18 Thread Liviu Nicoara

On 09/18/12 08:55, Stefan Teleman wrote:

On Mon, Sep 17, 2012 at 11:17 AM, Liviu Nicoara nikko...@hates.ms wrote:

I hope you agree that this synchronization is sufficient for the facet
initialization and reading of facet data.


I have reduced the number of reported race conditions in
22.locale.numpunct.mt from 12440:



I am attaching a test program which, while 100% MT-safe, is flagged by the 
Solaris thread analyzer. The test program contains, conceptually, the exact 
same of shared variable access like the locale management code, the facet 
management code and the facet data management code. This proves, if there was a 
need, that the tool results are to be further analyzed and interpreted but not 
taken at literal value.

The fix we are looking for is one which corrects the initial MT failure 
observed in caching the facet data, as well as it preserves the unguarded reads 
of facet data for performance reasons.

Thanks.

Liviu



http://s247136804.onlinehome.us/stdcxx-1056-SPARC-20120917/22.locale.numpunct.mt.nts.1.er.html/index.html

to 288:

http://s247136804.onlinehome.us/stdcxx-1056-20120918/22.locale.numpunct.mt.5.er.html/index.html

The changes are in the following files:

http://s247136804.onlinehome.us/stdcxx-1056-20120918/facet.cpp
http://s247136804.onlinehome.us/stdcxx-1056-20120918/punct.cpp

_numpunct.h looks like this:

http://s247136804.onlinehome.us/stdcxx-1056-20120918/_numpunct.h

With these changes, no races conditions are repoted for any of the
functions in std::numpunctT.

Still, there are 288 race conditions being reported in
__rw_locale::__rw_locale and in std::locale::_C_get_facet. We need to
identify the source and cause of these race conditions and correct
them as well.

This is not a complete solution to the problem, because we still have
to re-write the chunk of code I eliminated from facet.cpp. It is only
step one towards finding a real solution. But, at least for now, we
have pinpointed where the source of these race conditions is located,
and what causing it.

The test program was run as: ./22.locale.numpunct.mt --nthreads=8
--nloops=1.

More to follow.

--Stefan

#include stdio.h
#include stdlib.h
#include pthread.h
#include unistd.h

#define MAX_THREADS 128

static long counter = 0, nloops = 1000, nthreads = 16;
static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;

extern C {

static void* 
f (void* pv)
{
for (size_t i = 0; i  nloops; ++i) {
if (counter == 0) {
pthread_mutex_lock (lock);
if (counter == 0)
++counter;
pthread_mutex_unlock (lock);
}
else {
// counter value is safe to use here
}
}

printf (%ld\n, counter);

return 0;
}

}

int
main (int argc, char** argv)
{
switch (argc) {
case 3:
nloops = atol (argv [2]);
case 2:
nthreads = atol (argv [1]);
break;
}

pthread_t tid [MAX_THREADS] = { 0 };

if (nthreads  MAX_THREADS)
nthreads = MAX_THREADS;

for (int i = 0; i  nthreads; ++i) {
if (pthread_create (tid + i, 0, f, 0))
exit (-1);
}

for (int i = 0; i  nthreads; ++i) {
if (tid [i])
pthread_join (tid [i], 0);
}

return 0;
}


Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-18 Thread Liviu Nicoara

On 09/18/12 13:21, Stefan Teleman wrote:

On Tue, Sep 18, 2012 at 12:43 PM, Liviu Nicoara nikko...@hates.ms wrote:


I am attaching a test program which, while 100% MT-safe, is flagged by
the Solaris thread analyzer.


The program as written is not thread safe. It is reading the value of
the counter variable and performing a zero comparison outside of a
mutex lock:


Stefan, I urge you to consider the argument on its merits. Yes, the thread 
analyzer flags it, but it is nonetheless MT-safe. Specifically:

1. writes are properly synchronized wrt themselves
2. reads are inherently thread-safe wrt themselves
3. reads are properly synchronized wrt writes
4.no thread can possibly observe an intermediate or otherwise incomplete value.

I will also add that the flag is either 0 or 1 during the execution of the 
program, with only one transition from 0 to 1, performed by one single thread.

I will concede that I might be wrong and I am open to arguments. I would accept 
as a counter-argument this program if you can show a runtime failure. I would 
also accept as argument a scenario under which two threads would see 
inconsistent/incorrect values or write the variable more than once, etc.

Thanks,
Liviu



for (size_t i = 0; i  nloops; ++i) {
 if (counter == 0) {  // --- 
 pthread_mutex_lock (lock);
 if (counter == 0)
 ++counter;
 pthread_mutex_unlock (lock);
 }
 else {
 // counter value is safe to use here
 }
 }



Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-18 Thread Liviu Nicoara

On 09/18/12 18:24, Stefan Teleman wrote:

On Tue, Sep 18, 2012 at 4:35 PM, Liviu Nicoara nikko...@hates.ms wrote:


I will concede that I might be wrong and I am open to arguments. I would
accept as a counter-argument this program if you can show a runtime failure.


The the first read of the counter variable is outside a mutex lock
correct? The read is followed by a 0 comparison, correct?


Correct.



What guarantees that between the read and the comparison the value of
the counter variable hasn't been modified by another thread? The
thread currently doing the comparison cannot guarantee it: it hasn't
locked the mutex. Other threads may be running - actually they
probably are. Another thread may have already acquired the mutex and
incremented the value of counter. Your thread has no way of knowing if
that has happened, because it does not yet have exclusive access to
the counter variable. It will, after it acquires the mutex.


It does not matter if the value is changed concurrently in between the 
reading of it and the actual comparison. The value can only be 0, in 
which case it takes the branch that locks the mutex and the next read 
will absolutely see a fresh value; or it can be 1 in which case it has 
already been initialized and this thread does not pay the penalty of a 
lock anymore.




Where is it reading the variable  from? A register? Is it declared
volatile? L2 cache? L3 cache?


Irrelevant. If the value has not been changed it can be read from either 
the cache or directly from the memory. If the value has been changed 
concurrently in between the reading and the actual comparison, the 
processor sees either a stale value of 0, or the new value 1. It is 
guaranteed that any thread reading the value after the (one and only) 
unlock will see a fresh value.




The program, as you wrote it, implicitly acknowledges that it is not
thread safe. That is the point of the double check: one before the
mutex lock, and one after it.


That is a misstatement of the program intentions. See below.


The point of the first check has nothing
to do with thread-safety, and everything to do with a minor
optimization: if the value stored in variable counter is already not
zero, then there's no point in locking the mutex or performing the
increment.


The first check is indeed an optimization. It is the point of this 
exercise to show that the unguarded reads in the localization library 
are not defects and the code, simplified in my test case, is thread safe 
in exactly the respects I mentioned before: the program only observes 
consistent, correct values (program states) in a concurrent environment.


HTH.

Liviu



Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-18 Thread Liviu Nicoara

On 9/18/12 7:04 PM, Stefan Teleman wrote:

On Tue, Sep 18, 2012 at 6:42 PM, Liviu Nicoara nikko...@hates.ms wrote:


The first check is indeed an optimization. It is the point of this exercise
to show that the unguarded reads in the localization library are not defects
and the code, simplified in my test case, is thread safe in exactly the
respects I mentioned before: the program only observes consistent, correct
values (program states) in a concurrent environment.


The Intel Thread Analyzer shows a data race in the exact same spot the
SunPro Thread Analyzer shows a data race, and in the exact same spot I
have stated earlier that there is a data race.

The results are available here:

http://s247136804.onlinehome.us/liviu-mt-test/22.mt.test.r000ti3.inspxez

As usual, it is an Intel Inspector XE 2003 Analysis Results Blob. It
opens with the Intel Inspector. i don't know what format it is in, or
whether or not it can be read by other means.

So far we have SunPro 12.3/Linux/Intel, SunPro 12.3/Solaris/Intel,
SunPro 12.3/Solaris/SPARC and Intel 2003/Linux/Intel saying that there
is a race condition in your program.

And now I would like to get back to solving the thread-unsafety and
race conditions problems in Apache stdcxx.


Stefan, you cannot ignore others' arguments and embrace a 'no' attitude; the 
tools are not saying what you think they do. They all show the same thing 
because they all do the exact same analysis and we could delve into that, too. I 
sincerely hoped you could come up with some sort of a logical argument that 
would prove my latest test case wrong, or at least show it failing at runtime.


My time is also limited and I have already spent more time than I wanted to 
trying to present a logical argument, down to the simplest test cases. I say we 
defer this to Martin for when he will be back from vacation, next week. For the 
record, because people may not have the energy to read all this back and forth, 
I will summarize here the gist of this thread:


1. The facet data caching is not MT-safe
2. The facet data initialization (STDCXX or system locales) is safe (*)
3. There is no unit test currently showing a failure in (2)
4. Timing results show that caching may be slower than non-caching in MT builds
5. A fix should, ideally, be binary compatible
6. A fix should, ideally, preserve performance or increase it
7. There is one patch, currently attached to the issue, by Stefan
8. Other partial patches are referenced from this thread

Please correct me if I missed anything. The above summary is a good starting 
point for a new thread dealing with this issue.


HTH.

Liviu

(*) When perfectly forwarding, with a theoretical concern I asked in an earlier 
post, directed at Martin.




Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-17 Thread Liviu Nicoara

On 09/17/12 09:51, Stefan Teleman wrote:

On Mon, Sep 17, 2012 at 8:46 AM, Liviu Nicoara nikko...@hates.ms wrote:


In the meantime I would like to stress again that __rw_get_numpunct is
perfectly thread-safe and does not need extra locking for perfect
forwarding.


So, by removing the test for

   if (!(_C_flags  _RW::__rw_gr))

(or any other bitmask for that matter), the functions which were
thread-unsafe - and were exhibiting all the symptoms of a run-time
race condition -, magically became thread-safe?

I have looked *extensively* at the code in __rw_get_numpunct. It is
inherently thread-unsafe.


I mean to say that no extra locking is necessary when the public interface 
forwards and no caching is done.

In more detail: __rw_get_numpunct code is entered upon a call from the 
protected virtual interface. It calls facet _C_data member function which 
either returns objects constructed from previously-initialized POD data (in 
which case no locking is necessary), or it attempts to first initialize the 
facet data from the STDCXX database.

If the latter, the execution goes in the facet data initialization code which 
is appropriately synchronized, see this stack trace:

(gdb) bt
#0  __rw::__rw_facet::_C_get_data (this=0x6e10a0) at 
srcdir/stdcxx/branches/4.2.x/src/facet.cpp:179
#1  0x0043788b in __rw::__rw_facet::_C_data (this=0x6e10a0) at 
srcdir/stdcxx/branches/4.2.x/include/loc/_facet.h:194
#2  0x0044874f in __rw::__rw_get_numpunct (pfacet=0x6e10a0, flags=4) at 
srcdir/stdcxx/branches/4.2.x/src/punct.cpp:80
#3  0x00449a14 in __rw::__rw_get_punct (pfacet=0x6e10a0, flags=4) at 
srcdir/stdcxx/branches/4.2.x/src/punct.cpp:578
#4  0x0041cecf in std::numpunctchar::do_grouping (this=0x6e10a0) at 
srcdir/stdcxx/branches/4.2.x/include/loc/_numpunct.h:99
#5  0x0041bd31 in std::numpunctchar::grouping (this=0x6e10a0) at 
srcdir/stdcxx/branches/4.2.x/include/loc/_numpunct.h:190
#6  0x004036b8 in main (argc=2, argv=0x7fffe018) at 
srcdir/stdcxx/branches/4.2.x/tests/localization/s.cpp:29
(gdb)

When the above fails, the facet data has not been initialized, e.g., when there 
is no STDCXX locale database, or the name of the locale does not refer to a 
locale in STDCXX database. The __rw_get_numpunct function will attempt next to 
use libc and system locales, via the __rw_setlocale class, which again is 
properly synchronized and the scope of that lock extends to cover the facet 
data initialization. See this stack trace:

(gdb) bt
#0  __rw::__rw_setlocale::__rw_setlocale (this=0x7fffdd30, locname=0x6e112a 
en_US.utf8, cat=6, nothrow=0) at 
srcdir/stdcxx/branches/4.2.x/src/setlocale.cpp:84
#1  0x00448974 in __rw::__rw_get_numpunct (pfacet=0x6e10a0, flags=4) at 
srcdir/stdcxx/branches/4.2.x/src/punct.cpp:134
#2  0x00449a14 in __rw::__rw_get_punct (pfacet=0x6e10a0, flags=4) at 
srcdir/stdcxx/branches/4.2.x/src/punct.cpp:578
#3  0x0041cecf in std::numpunctchar::do_grouping (this=0x6e10a0) at 
srcdir/stdcxx/branches/4.2.x/include/loc/_numpunct.h:99
#4  0x0041bd31 in std::numpunctchar::grouping (this=0x6e10a0) at 
srcdir/stdcxx/branches/4.2.x/include/loc/_numpunct.h:190
#5  0x004036b8 in main (argc=2, argv=0x7fffe018) at 
srcdir/stdcxx/branches/4.2.x/tests/localization/s.cpp:29

I hope you agree that this synchronization is sufficient for the facet 
initialization and reading of facet data.

Thanks.

Liviu


Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-17 Thread Liviu Nicoara

On 09/17/12 11:21, Stefan Teleman wrote:

On Mon, Sep 17, 2012 at 11:17 AM, Liviu Nicoara nikko...@hates.ms wrote:


I hope you agree that this synchronization is sufficient for the facet
initialization and reading of facet data.


Sorry, I do not agree. Two different thread analyzers from two
different compilers written by two different compiler writers tell me
not to.


Stefan, that is indeed your prerogative. However, please keep in mind that 
tools may be buggy or may have limitations beyond what is advertised. I do not 
ask you to have faith in my analysis, which would be absurd, but to look for 
yourself, exercise due diligence in testing the code and draw your own 
conclusions.

Thanks,
Liviu


Sun fbe assembler manual

2012-09-17 Thread Liviu Nicoara

Hi all,

I need a reference manual for the fbe assembler. I am interested in the syntax 
of the `.type' directive. Do you know where such a manual would be located? The 
Solaris assembler manual I found on Oracle's website does not mention the .type 
directive.

Thanks!

Liviu



Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-16 Thread Liviu Nicoara

On 9/16/12 3:20 AM, Stefan Teleman wrote:

On Sat, Sep 15, 2012 at 4:53 PM, Liviu Nicoara nikko...@hates.ms wrote:


Now, to clear the confusion I created: the timing numbers I posted in the
attachment stdcxx-1056-timings.tgz to STDCXX-1066 (09/11/2012) showed that a
perfectly forwarding, no caching public interface (exemplified by a changed
grouping) performs better than the current implementation. It was that test
case that I hoped you could time, perhaps on SPARC, in both MT and ST
builds. The t.cpp program is for MT, s.cpp for ST.


I got your patch, and have tested it.


Thanks, Stefan. I looked over it and it seems very similar to, and somewhat more 
detailed than gprof profiling output.


I am going to update the incident shortly with a more detailed timing 
measurements on my side, in the form of a new attachment. Just FYI in case you 
still don't get notifications.


Liviu


Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-16 Thread Liviu Nicoara

On 9/16/12 11:21 AM, Liviu Nicoara wrote:

On 9/16/12 3:20 AM, Stefan Teleman wrote:

On Sat, Sep 15, 2012 at 4:53 PM, Liviu Nicoara nikko...@hates.ms wrote:


Now, to clear the confusion I created: the timing numbers I posted in the
attachment stdcxx-1056-timings.tgz to STDCXX-1066 (09/11/2012) showed that a
perfectly forwarding, no caching public interface (exemplified by a changed
grouping) performs better than the current implementation. It was that test
case that I hoped you could time, perhaps on SPARC, in both MT and ST
builds. The t.cpp program is for MT, s.cpp for ST.


I got your patch, and have tested it.


Thanks, Stefan. I looked over it and it seems very similar to, and somewhat more
detailed than gprof profiling output.

I am going to update the incident shortly with a more detailed timing
measurements on my side, in the form of a new attachment. Just FYI in case you
still don't get notifications.


I have attached a new set of results to the incident, in the form of the 
archive:

http://tinyurl.com/9drzg4e

Please see the content for a description of the library changes (_numpunct.h 
file), the MT test program (t.cpp) and the results collected through two 
separate builds on two different machines (results.txt file).


Thanks.

Liviu


STDCXX-1066 [was: Re: STDCXX forks]

2012-09-15 Thread Liviu Nicoara

On 9/1/12 1:52 PM, Stefan Teleman wrote: On Sat, Sep 1, 2012 at 12:15 PM, 
 I opened yesterday STDCXX-1066:

 https://issues.apache.org/jira/browse/STDCXX-1056

 about the pthread_mutex_t/pthread_cond_t alignment on SPARCV8. I'll
 have patches done this weekend. Achtung: the patchset is very large
 and touches a very large number of files. It's strange that I didn't
 get an email about STDCXX-1066.


Hi Stefan,

I have read through the patches attached to the incident, then I briefly read 
about the SunPro pragma align and pack. Two questions:


1. AFAICT, the use of the packing pragma may interfere with a user's setting of 
the same value. I.e., a user sets the packing in their sources and then, 
directly or not, includes an STDCXX header. It seems to me that in such a 
situation, our setting of the packing value would interfere with the rest of the 
user's translation unit, since there is no way to `restore' the previous packing 
value.


Something along the lines of:

// user source file

#pragma pack (X) // X != 8

#include iostream

struct UserDef
{
// different alignment than X ?
// ...
};

Is my understanding correct?

2. The patches are against 4.2.1, but the change would be binary incompatible 
with the already released 4.2.1 branch. Do you plan to have this fix in 4.3.x?


Thanks.

Liviu


Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-15 Thread Liviu Nicoara

On 9/15/12 1:51 PM, Stefan Teleman wrote:

On Sat, Sep 15, 2012 at 9:01 AM, Liviu Nicoara nikko...@hates.ms wrote:


That is funny. What compiler are you using? What does the following test
case return for you?


It's the Intel compiler with the patched stdcxx for  the wrong case
and GCC 4.7.1 + GNU libstdc++ for  the correct case.

GCC + GNU libstdc++ are correct.

The patched facet does not call the protected do_*() virtual functions
from their public counterparts, as it is required to do by the
Standard. Instead, it returns the data mebers directly (the data
members were initialized in the constructor). That is the patch you
proposed, which is indeed much better performing than using a mutex
lock. Unfortunately, in doing so, overriding the virtual functions in
a derived facet type becomes pointless.


Ahh, I see now. You are indeed right, that patch is defective. I was under the 
impression that we were discussing the (later) attachment 
stdcxx-1056-timings.tgz which contains a perfectly forwarding implementation of 
the facet public grouping method. The timings I attached there were the ones I 
thought we were discussing all along.


Now, to clear the confusion I created: the timing numbers I posted in the 
attachment stdcxx-1056-timings.tgz to STDCXX-1066 (09/11/2012) showed that a 
perfectly forwarding, no caching public interface (exemplified by a changed 
grouping) performs better than the current implementation. It was that test case 
that I hoped you could time, perhaps on SPARC, in both MT and ST builds. The 
t.cpp program is for MT, s.cpp for ST.


Please let me know if this clarifies things. I apologize for the 
misunderstanding.

Liviu


Re: STDCXX-1066 [was: Re: STDCXX forks]

2012-09-15 Thread Liviu Nicoara

On 9/15/12 2:57 PM, Stefan Teleman wrote:

On Sat, Sep 15, 2012 at 1:02 PM, Liviu Nicoara nikko...@hates.ms wrote:


I have read through the patches attached to the incident, then I briefly
read about the SunPro pragma align and pack. Two questions:

1. AFAICT, the use of the packing pragma may interfere with a user's setting
of the same value. I.e., a user sets the packing in their sources and then,
directly or not, includes an STDCXX header. It seems to me that in such a
situation, our setting of the packing value would interfere with the rest of
the user's translation unit, since there is no way to `restore' the previous
packing value.

Something along the lines of:

// user source file

#pragma pack (X) // X != 8

#include iostream

struct UserDef
{
 // different alignment than X ?
 // ...
};

Is my understanding correct?


Yes, you are indeed correct, Sir. :-)

However, for every #pragma pack(8) there should be a corresponding
subsequent #pragma pack(0). If there isn't, that's a patch bug.
#pragma pack(0) restores the default alignment. So, there should be
(there *must* be) no silent packing side-effects in user programs.


Yes, but it restores the default packing, not an arbitrary one, potentially set 
by the user prior to including our headers. Say, the user sets 2, the default is 
4 and we set 8. When we set it to default it goes back to 4, instead of the 
expected 2. Did I get this right?





2. The patches are against 4.2.1, but the change would be binary
incompatible with the already released 4.2.1 branch. Do you plan to have
this fix in 4.3.x?


Yes, definitely for 4.2.x and 4.3.x. I sent them for 4.2.1 just
because that's what I have handy.


I see now. I tried to apply them to 4.2.x and some chunks failed, and I thought 
I was not applying them where they were intended to go.


Thanks!

Liviu


Re: STDCXX-1066 [was: Re: STDCXX forks]

2012-09-15 Thread Liviu Nicoara

On 9/15/12 5:19 PM, Stefan Teleman wrote:

On Sat, Sep 15, 2012 at 4:57 PM, Liviu Nicoara nikko...@hates.ms wrote:


Yes, but it restores the default packing, not an arbitrary one, potentially
set by the user prior to including our headers. Say, the user sets 2, the
default is 4 and we set 8. When we set it to default it goes back to 4,
instead of the expected 2. Did I get this right?


This is true, but leaving some arbirary pragma pack(N) (for N != 0) in
effect for the duration of a program, and expecting it to work sounds
like a very defective programming approach to me. It will certainly
not work on SPARC at all.  if the program needs to pack something in a
certain specific way, it need to do so for that specific something
only. Otherwise the side-effects of globally setting a non-default
packing will destroy the program anyway.


I merely wanted to point out that restoring the default packing is not the same 
as restoring the packing to the previous value in effect.


Given this, I thought about an alternative way of forcing this alignment, e.g., 
via a union, aligned on an appropriate type. I see an advantage here in that 
most of the changes would occur where we define the 'primitive' mutex and 
condition wrappers, saving a few pre-processor conditionals and pragmas along 
the way.


What do you think?

Liviu


Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-12 Thread Liviu Nicoara

On 09/12/12 00:21, Stefan Teleman wrote:

On Tue, Sep 11, 2012 at 10:18 PM, Liviu Nicoara nikko...@hates.ms wrote:


AFAICT, there are two cases to consider:

1. Using STDCXX locale database initializes the __rw_punct_t data in the
first, properly synchronized pass through __rw_get_numpunct. All subsequent
calls use the __rw_punct_t data to construct returned objects.
2. Using the C library locales does the same in the first pass, via
setlocale and localeconv, but setlocale synchronization is via a per-process
lock. The facet data, once initialized is used just like above.

I probably missed this in the previous conversation, but did you detect a
race condition in the tests if the facets are simply forwarding to the
private virtual interface? I.e., did you detect that the facet
initialization code is unsafe? I think the facet __rw_punct_t data is safely
initialized in both cases, it's the caching that is done incorrectly.


I originally thought so too, but now I'm having doubts. :-) And I
haven't tracked it down with 100% accuracy yet. I saw today this
comment in src/facet.cpp, line 358:

// a per-process array of facet pointers sufficiently large
// to hold (pointers to) all standard facets for 8 locales
static __rw_facet*  std_facet_buf [__rw_facet::_C_last_type * 8];


The localization code is cleverly dense and low-level. There are a few patterns 
that are used throughout, though. One is the function-local static buffers, 
e.g., the locales buffer and the facets buffer, in locale_body.cpp and 
facet.cpp, respectively. They both start from a reasonable size and are grown 
and shrunk according to the needs, but neither locales nor facets are ever 
evicted from them. See:

locale_body:936  -- reduction of locale buffer
locale_body:1006 -- expansion of locale buffer

facet.cpp:397-- reduction of facet buffer
facet.cpp:462-- expansion of facet buffer

Facets buffer is guarded by a static guard in:

__rw_facet::_C_manage:366

The locales buffer is protected by a static guard in:

__rw_locale::_C_manage:880




[...] this is all investigative stuff for tomorrow. :-)

and I agree with Martin that breaking ABI in a minor release is really
not an option. I'm trying to find the best way of making these facets
thread-safe while inflicting the least horrible performance hit.


We are getting close to the point where we can summarize all these findings and 
gauge an appropriate course of action.



I will run your tests tomorrow and let you know. :-)



That would be very helpful!

Liviu


Re: Board report time

2012-09-11 Thread Liviu Nicoara

On 09/11/12 08:15, Jim Jagielski wrote:

It's time for our report to the board...

what would we like to share?

I see:

  o renewed discussion on health/viability of pmc
  o increased development being done
  o PMC expressing interest in moving to git


This sounds about right. It should also mention that some members expressed 
interest in alternative licensing.

Thanks.

Liviu



Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-11 Thread Liviu Nicoara

On 9/11/12 9:40 PM, Martin Sebor wrote:

On 09/11/2012 04:15 PM, Stefan Teleman wrote:

On Mon, Sep 10, 2012 at 4:24 PM, Stefan Teleman

I think I have something which doesn't break BC - stay tuned because
I'm testing it now.


OK.

So, here's a possible implementation of __rw_get_numpunct() with
minimal locking, which passes the MT tests and does not break ABI:

s247136804.onlinehome.us/stdcxx-1056-20120911/punct.cpp

And the same for include/loc/_numpunct.h:

http://s247136804.onlinehome.us/stdcxx-1056-20120911/_numpunct.h

In _numpunct.h, all the functions perform no checks and no lazy
initialization. They function simply as a pass-through to
__rw_get_numpunct(). std::numpunctT's data members are now dead
varaiables.

The bad: performance is no better than with locking the mutex inside
each of the std::numpunctT::*() functions, and with lazy
instantiation.


I wouldn't expect this to be faster than the original. In fact,
I would expect it to be slower because each call to one of the
public, non-virtual members results in a call to the out-of-line
virtual functions (and another to __rw_get_moneypunct). Avoiding
the overhead of such calls is the main and only reason why the
caching exists.



AFAICT, there are two cases to consider:

1. Using STDCXX locale database initializes the __rw_punct_t data in the first, 
properly synchronized pass through __rw_get_numpunct. All subsequent calls use 
the __rw_punct_t data to construct returned objects.
2. Using the C library locales does the same in the first pass, via setlocale 
and localeconv, but setlocale synchronization is via a per-process lock. The 
facet data, once initialized is used just like above.


I probably missed this in the previous conversation, but did you detect a race 
condition in the tests if the facets are simply forwarding to the private 
virtual interface? I.e., did you detect that the facet initialization code is 
unsafe? I think the facet __rw_punct_t data is safely initialized in both cases, 
it's the caching that is done incorrectly.



I'm afraid unoptimized timings don't tell us much. Neither does
a comparison between two compilers, even on the same OS.

I looked at Liviu's timings today. I was puzzled by the difference
between (1) which, IIUC, is the current implementation (presumably
an optimized, thread-safe build with the same compiler and OS) and
(4), which, again IIUC, is the equivalent of your latest patch here
(again, presumably optimized, thread safe, same compiler/OS). I'm
having trouble envisioning how calling a virtual function to
retrieve the value of grouping can possibly be faster than not
calling it (and simply returning the value cached in the data
member of the facet.



The new results I attached to the issue come from a a bit clearer tests and they 
focus on just two cases: the current implementation vs. a non-caching one; the 
latter just forwards the grouping calls to the protected do_grouping, with _no_ 
other changes to the implementation.


The timing numbers seem to show that MT builds fare far worse with the caching 
than without. Stefan, if you have the time, could you please infirm :) my 
conclusions by timing it on one of your machines?


Thanks,

Liviu



Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-10 Thread Liviu Nicoara

On 09/10/12 15:01, Stefan Teleman wrote:

On Mon, Sep 10, 2012 at 2:21 PM, Liviu Nicoara nikko...@hates.ms wrote:


4. Without caching of grouping values, grouping() delegates always to
do_grouping():

real0m5.668s
user1m11.389s
sys 0m3.952s


FWIW, 22.2.3.1.1 explicitly states that all of the decimal_point(),
grouping(), truename(), falsename() must return their do_*()
counterparts.


Yes, of course. :-) I just meant to say that is _always_ delegating to 
do_grouping(), i.e., does no caching.

Liviu



--Stefan




--
And now I see with eye serene
The very pulse of the machine.


[PATCH] STDCXX-1067 Mac builds

2012-09-09 Thread Liviu Nicoara
The default compiler on recent Apple Macs is LLVM with Clang and gcc C++ 
front-ends. The compiler does not come with a C++ language support library. 
However, gcc Mac builds are fine with GNU stock compilers, modulo the issues for 
which I attach the patch below, for review.


I built successfully and ran the test suite on both Mac and Linux, wigh gcc 
4.5.4 and 4.5.2, respectively.


Thanks.

Liviu


Index: src/x86_64/atomic.s
===
--- src/x86_64/atomic.s (revision 1382343)
+++ src/x86_64/atomic.s (working copy)
@@ -26,8 +26,17 @@
  * 
  **/
 
+#if defined (__MACH__)
+// Mac OS X Mach-O assembler: no .type, power of two alignment
+#  define ALIGN_DIR   .align 4
+#  define TYPE_DIR(ignore,ignore2) 
+#else
+#  define ALIGN_DIR   .align 16
+#  define TYPE_DIR(sym,attr)  .type sym, attr
+#endif // __MACH__
+
 .text
-.align 16
+ALIGN_DIR
 
 /***
  * extern C int8_t __rw_atomic_xchg8 (int8_t *x, int8_t y);
@@ -37,7 +46,7 @@
  **/
 
 .globl __rw_atomic_xchg8
-.type __rw_atomic_xchg8, @function
+TYPE_DIR (__rw_atomic_xchg8, STT_FUNC)
 __rw_atomic_xchg8: /* ; int8_t (int8_t *x, int8_t y)  */
 movq  %rdi, %rcx   /* ; %rcx = x  */
 movb  %sil,  %al   /* ; %al = y   */
@@ -53,7 +62,7 @@
  **/
 
 .globl __rw_atomic_xchg16
-.type __rw_atomic_xchg16, @function
+TYPE_DIR (__rw_atomic_xchg16, STT_FUNC)
 __rw_atomic_xchg16:/* ; int16_t (int16_t *x, int16_t y) */
 movq  %rdi, %rcx   /* ; %rcx = x*/
 movw  %si,  %ax/* ; %ax = y */
@@ -69,7 +78,7 @@
  **/
 
 .globl __rw_atomic_xchg32
-.type __rw_atomic_xchg32, @function
+TYPE_DIR (__rw_atomic_xchg32, STT_FUNC)
 __rw_atomic_xchg32:/* ; int32_t (int32_t *x, int32_t y) */
 movq  %rdi,  %rcx  /* ; %rcx = x*/
 movl  %esi,  %eax  /* ; %eax = y*/
@@ -85,7 +94,7 @@
  **/
 
 .globl __rw_atomic_xchg64
-.type __rw_atomic_xchg64, @function
+TYPE_DIR (__rw_atomic_xchg64, STT_FUNC)
 __rw_atomic_xchg64:/* ; int64_t (int64_t *x, int64_t y) */
 movq  %rdi,  %rcx  /* ; %rcx = x*/
 movq  %rsi,  %rax  /* ; %rax = y*/
@@ -101,7 +110,7 @@
  **/
 
 .globl __rw_atomic_add8
-.type __rw_atomic_add8, @function
+TYPE_DIR (__rw_atomic_add8, STT_FUNC)
 __rw_atomic_add8:  /* ; int8_t (int8_t *dst, int8_t inc) */
 movq   %rdi, %rcx  /* ; %rcx = dst   */
 movl   %esi, %eax  /* ; %eax = inc   */
@@ -123,7 +132,7 @@
  **/
 
  .globl __rw_atomic_add16
-.type __rw_atomic_add16, @function
+TYPE_DIR (__rw_atomic_add16, STT_FUNC)
 __rw_atomic_add16: /* ; int16_t (int16_t *dst, int16_t inc) */
 movq   %rdi, %rcx  /* ; %rcx = dst  */
 movw   %si,  %ax   /* ; %ax = inc   */
@@ -146,7 +155,7 @@
  **/
 
 .globl __rw_atomic_add32
-.type __rw_atomic_add32, @function
+TYPE_DIR (__rw_atomic_add32, STT_FUNC)
 __rw_atomic_add32: /* ; int32_t (int32_t *dst, int32_t inc) */
 movq   %rdi, %rcx  /* ; %rcx = dst  */
 movl   %esi, %edx  /* ; %edx = inc  */
@@ -169,7 +178,7 @@
  **/
 
 .globl __rw_atomic_add64
-.type __rw_atomic_add64, @function
+TYPE_DIR (__rw_atomic_add64, STT_FUNC)
 __rw_atomic_add64: /* ; int64_t (int64_t *dst, int64_t inc) */
 movq   %rdi, %rcx  /* ; %rcx = dst  */
 movq   %rsi, %rdx  /* ; %edx = inc  */
Index: etc/config/gcc.config
===
--- etc/config/gcc.config   (revision 1382343)
+++ etc/config/gcc.config   (working copy)
@@ -40,6 +40,7 @@
 else
 ifeq ($(OSNAME),Darwin)
 OS_MAJOR := $(shell 

Re: [PATCH] STDCXX-1067 Mac builds

2012-09-09 Thread Liviu Nicoara

On 9/9/12 7:07 PM, Wojciech Meyer wrote:

Hi Liviu,

I don't use Mac OS X at all but:

Liviu Nicoara nikko...@hates.ms writes:


The default compiler on recent Apple Macs is LLVM with Clang and gcc
C++ front-ends. The compiler does not come with a C++ language support
library. However, gcc Mac builds are fine with GNU stock compilers,
modulo the issues for which I attach the patch below, for review.


I think it will be a big win to support Clang for the community.


AFAICT, there is only one problem with that, the lack of a complete language 
support library.


Pathscale released a replacement for libsupc++/libgcc_s in the form of the pair 
libcxxrt/libunwind, which provide equivalent functionality on *BSD/Linux, but 
not Darwin.


Perhaps Christopher can shed some light here.

Liviu




Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-07 Thread Liviu Nicoara

On 09/06/12 19:54, Martin Sebor wrote:

I'm not sure how easily we can do that. Almost all of locale
is initialized lazily. Some of the layers might depend on the
facets being initialized lazily as well. This was a deliberate
design choice. One of the constraints was to avoid dynamic
initialization or allocation at startup. [...]


There would be a performance degradation. IMHO, it would be minor and would 
simplify the code considerably.

After finally being able to reproduce the defect with SunPro 12.3 on x86_64 I 
tried to remove the lazy initialization and a (smaller) test case now passes. I 
am attaching the program and the numpunct patch.


Out of curiosity, does it still work if you move the locale into
the thread function? Like this:


I created the test case because I wanted something more specific and with more 
predictable results. Initially, the locale object was created in the thread and 
it was passing.




Also, does the 27.objects test pass with this patch?


Will try. I don't think that we can actually use this patch, I bundled it with 
the test so that people can test the approach right away.



I don't know if we have tests for it but writing to cerr/cout
in a bad_alloc handler should succeed. I.e., this should not
abort:

   try {
   struct rlimit rlim = { };
   getrlimit (RLIMIT_DATA, rlim);
   rlim.rlim_cur = 0;
   setrlimit (RLIMIT_DATA, rlim);

   for ( ; ; ) new int;
   }
   catch (bad_alloc) {
   cout  caught bad alloc:  123  '\n';
   }


I see..

Liviu



Re: A question (or two) of procedure, etc.

2012-09-07 Thread Liviu Nicoara

On 09/06/12 23:00, Martin Sebor wrote:

Every project has certain branch strategy, I'm not sure about this so
maybe Martin can advice. I prefer to develop on trunk and cherry pick
to the other branches avoiding bulk merges (and that's in both
directions).


We've done most work on 4.2.x for historical reasons. I think
a better strategy is to develop, as you suggest, on trunk which
has the least restrictive commit policy, and merge changes out
to the more restrictive branches as appropriate.


If open to voting, I am for trunk first, branches later.

Liviu


Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-07 Thread Liviu Nicoara

On 09/06/12 22:58, Stefan Teleman wrote:

On Thu, Sep 6, 2012 at 7:31 PM, Liviu Nicoara nikko...@hates.ms wrote:


There would be a performance degradation. IMHO, it would be minor and would 
simplify the code considerably.

After finally being able to reproduce the defect with SunPro 12.3 on x86_64 I 
tried to remove the lazy initialization and a (smaller) test case now passes. I 
am attaching the program and the numpunct patch.


With your patches, the performance is much much better:


I didn't think about timing it, thanks! The result is somewhat expected without 
the synchronization. However, the full construction-time initialization is 
inappropriate if you take in consideration the delegation from the public API 
to the protected virtual API of the facet class, and the possibility of 
overrides in subclasses. Also, lazy initialization follows the principle of 
economy of effort, and caching increases efficiency. If possible in a robust 
way, I would rather have those two. :-)

Liviu



# INFO (S1) (10 lines):
# TEXT:
# COMPILER: Intel C++, __INTEL_COMPILER = 1210,
__INTEL_COMPILER_BUILD_DATE = 20111011, __EDG_VERSION__ = 403
# ENVIRONMENT: pentiumpro running linux-elf (Fedora release 17 (Beefy
Miracle) (3.5.0-2.fc17.x86_64)) with glibc 2.15
# FILE: 22.locale.numpunct.mt.cpp
# COMPILED: Sep  6 2012, 20:50:13
# COMMENT: thread safety


# +---+--+--+--+
# | DIAGNOSTIC|  ACTIVE  |   TOTAL  | INACTIVE |
# +---+--+--+--+
# | (S1) INFO |   11 |   11 |   0% |
# | (S2) NOTE |1 |1 |   0% |
# | (S8) ERROR|0 |3 | 100% |
# | (S9) FATAL|0 |1 | 100% |
# +---+--+--+--+
real 1035.05
user 1191.76
sys 63.49

--Stefan




--
And now I see with eye serene
The very pulse of the machine.


Re: A question (or two) of procedure, etc.

2012-09-07 Thread Liviu Nicoara

On 09/07/12 10:54, Martin Sebor wrote:

We should remember that there are a number of Jira issues that
we fixed on 4.2.x but haven't merged out to 4.3.x or trunk. The
idea behind the current process (4.2.x - 4.3.x - trunk) was
to be able to simply merge the branches in bulk, as opposed to
an fix at a time. Unfortunately, we ran into some Subversion
issues that made it a huge pain. IIRC, Travis did it at least
once so he might still remember the details.


That would be very helpful to know.



Going forward, to avoid this mess, I would suggest we make
an effort to commit fixes wherever necessary at the same
time instead of delaying it until some time in the future.


Got it.

Liviu


dbx [was: Re: STDCXX-1056 [was: Re: STDCXX forks]]

2012-09-07 Thread Liviu Nicoara

On 09/07/12 11:58, Stefan Teleman wrote:

On Fri, Sep 7, 2012 at 8:52 AM, Liviu Nicoara nikko...@hates.ms wrote:

On 09/06/12 19:54, Martin Sebor wrote:



Also, does the 27.objects test pass with this patch?



No, it does not. It hangs at the first insertion, line 227. Unfortunately, I
cannot debug it because dbx does not work properly in my installation.


What are the symptoms with dbx mishbehaving? (maybe I can help).


Thanks!

I get this when launching the debugger:

$ dbx -xexec32 t
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.9' in your .dbxrc
Reading t
Reading ld-linux.so.2
dbx: fetch at 0xf400 failed -- Input/output error
dbx: warning: could not put in breakpoint
dbx: warning: internal handler (-80) made defunct -- could not enable event BPT
dbx: warning: internal handler (-86) made defunct -- could not enable event BPT
...

Then the program just runs from the get go. If I set breakpoints and rerun, it 
ignores them.

Also, long running programs cannot be broken into with C-c:

^Cdbx: warning: Interrupt ignored but forwarded to child.
signal INT in (unknown) at 0xf77cf6c4
0xf77cf6c4: movl %edx,%eax
dbx: warning: 'stop' ignored -- while doing rtld handshake
terminating signal 2 SIGINT
dbx: warning: 'stop' ignored -- while doing rtld handshake
(dbx) where
dbx: program is not active
(dbx)

Liviu


Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-06 Thread Liviu Nicoara

On 09/05/12 23:51, Martin Sebor wrote:

On 09/05/2012 09:03 PM, Stefan Teleman wrote:

[...]
Agreed.

But: if the choice is between an implementation which [1] breaks ABI
and [2] performs 20% worse -- even in contrived test cases -- than
another implementation [2] which doesn't break ABI, and performs
better than the first one,  why would we even consider the first one?


Breaking the ABI is not an option (unless we rev the version).
But I'm not sure I understand what you think breaks the ABI.


I think Stefan is referring to adding a mutex member variable to the facet in 
question and breaking binary compatibility. If that is the case I have confused 
things when I suggested exactly that, earlier. A cursory read through the 
__rw_facet source shows that inherits from __rw_synchronized in MT builds, 
therefore each facet carries its own mutex member.

Liviu


We don't need to add a new mutex -- we can use the __rw_facet
member for the locking. Or did you mean something else?

Martin



--Stefan




--
And now I see with eye serene
The very pulse of the machine.


A question (or two) of procedure, etc.

2012-09-06 Thread Liviu Nicoara

What is the latest policy in what regards trivial fixes, e.g., the volatile 
qualifier for the max var in LIMITS.cpp we discussed earlier, etc.? It seems 
excessive to create a bug report for such issues.

Also, IIUC from reading previous discussions, forward and backward binary 
compatible changes go in 4.2.x, followed by merges to 4.3.x and trunk. Am I 
getting this right?

Also, besides the Linux, FreeBSD, Windows, Solaris builds hosted on Apache 
(Jenkins) is anybody building on HP-UX, AIX, etc.?

Thanks.

Liviu



Re: A question (or two) of procedure, etc.

2012-09-06 Thread Liviu Nicoara

On 09/06/12 14:37, Wojciech Meyer wrote:

Liviu Nicoara nikko...@hates.ms writes:


What is the latest policy in what regards trivial fixes, e.g., the
volatile qualifier for the max var in LIMITS.cpp we discussed earlier,
etc.? It seems excessive to create a bug report for such issues.


[...]
So I vote for keeping the changes as lightweight as possible, and avoid
extra bureaucracy if it makes sense. This assumption is based that
developers here trust their selves, and run the tests often. I'm not
subscribed to the commit list here, but if I do - I usually follow
people's changes and assume that developers do read commit lists.


Makes sense. Thanks.



So the general consensus from my experience with other project was: not
sure - ask.

That's my experience, also I don't have full rights to express my
opinion right now about stdcxx.


I sure hope we can have totally open (civilized) discussions going forward. :)




Also, IIUC from reading previous discussions, forward and backward
binary compatible changes go in 4.2.x, followed by merges to 4.3.x and
trunk. Am I getting this right?


Every project has certain branch strategy, I'm not sure about this so
maybe Martin can advice. I prefer to develop on trunk and cherry pick
to the other branches avoiding bulk merges (and that's in both
directions).


Right... I saw some discussions from a while back about active development on 4.2.x 
with integration to other branches as dictated by compatibility (e.g., integrating 
4.2.x - 4.3.x and 4.2.x - 4.2.1), and reintegration to trunk as needed. I am 
not looking to change any such policy just wanna make sure I am not messing something 
up.





Also, besides the Linux, FreeBSD, Windows, Solaris builds hosted on Apache 
(Jenkins) is anybody building on HP-UX, AIX, etc.?


Thanks.

Liviu


Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-06 Thread Liviu Nicoara

On Sep 5, 2012, at 4:03 PM, Martin Sebor wrote:

 On 09/05/2012 01:33 PM, Liviu Nicoara wrote:
 On 09/05/12 15:17, Martin Sebor wrote:
 On 09/05/2012 12:40 PM, Liviu Nicoara wrote:
 On 09/05/12 14:09, Stefan Teleman wrote:
 On Wed, Sep 5, 2012 at 10:52 AM, Martin Sebor mse...@gmail.com wrote:
 [...]
 OK so I did a little bit of testing, after looking at the *right*
 __rw_guard class. :-)
 
 I changed the std::numpunct class thusly:
 [...]
 
 I am afraid this would be unsafe, too (if I said otherwise earlier I was
 wrong). [...] Thoughts?
 
 You're right, there's still a problem. We didn't get the double
 checked locking quite right.
 
 I wish for simplicity: eliminate the lazy initialization, and fully
 initialize the facet in the constructor. Then we'd need no locking on
 copying its data (std::string takes care of its own copying).
 
 I'm not sure how easily we can do that. Almost all of locale
 is initialized lazily. Some of the layers might depend on the
 facets being initialized lazily as well. This was a deliberate
 design choice. One of the constraints was to avoid dynamic
 initialization or allocation at startup. [...]

There would be a performance degradation. IMHO, it would be minor and would 
simplify the code considerably.

After finally being able to reproduce the defect with SunPro 12.3 on x86_64 I 
tried to remove the lazy initialization and a (smaller) test case now passes. I 
am attaching the program and the numpunct patch. 

Will continue to look into it. Although STDCXX does not have an implementation 
of the atomics library we could probably use the existing, unexposed, atomics 
API to implement a more robust lazy initialization. 

Liviu

$ cat tests/localization/t.cpp; svn diff
#include locale

#include cstdio
#include cstdlib

#include unistd.h

#define MAX_THREADS16
#define MAX_LOOPS1000

static bool hold = true;

extern C {

static void* 
f (void* pv)
{
std::numpunctchar const fac = 
*reinterpret_cast std::numpunctchar*  (pv);

while (hold) ; 

for (int i = 0; i != MAX_LOOPS; ++i) {
std::string const grp = fac.grouping ();
}

return 0;
}

}

int
main (int argc, char** argv)
{
std::locale const loc = std::locale (argv [1]);

std::numpunctchar const fac =
std::use_facetstd::numpunctchar  (loc);

pthread_t tid [MAX_THREADS] = { 0 };

for (int i = 0; i  MAX_THREADS; ++i) {
if (pthread_create (tid + i, 0, f, (void*)fac)) 
exit (-1);
}

sleep (1);

hold = false;

for (int i = 0; i  MAX_THREADS; ++i) {
if (tid [i])
pthread_join (tid [i], 0);
}

return 0;
}

Index: include/loc/_numpunct.h
===
--- include/loc/_numpunct.h (revision 1381575)
+++ include/loc/_numpunct.h (working copy)
@@ -61,24 +61,40 @@ struct numpunct: _RW::__rw_facet
 string_type;
 
 _EXPLICIT numpunct (_RWSTD_SIZE_T __ref = 0)
-: _RW::__rw_facet (__ref), _C_flags (0) { }
+: _RW::__rw_facet (__ref) {
+_C_grouping   = do_grouping ();
+_C_truename   = do_truename ();
+_C_falsename  = do_falsename ();
+_C_decimal_point  = do_decimal_point ();
+_C_thousands_sep  = do_thousands_sep ();
+}
 
 virtual ~numpunct () _RWSTD_ATTRIBUTE_NOTHROW;
 
 // 22.2.3.1.1, p1
-char_type decimal_point () const;
+char_type decimal_point () const {
+return _C_decimal_point;
+}
 
 // 22.2.3.1.1, p2
-char_type thousands_sep () const;
+char_type thousands_sep () const {
+return _C_thousands_sep;
+}
 
 // 22.2.3.1.1, p3
-string grouping () const;
+string grouping () const {
+return _C_grouping;
+}
 
 // 22.2.3.1.1, p4
-string_type truename () const;
+string_type truename () const {
+return _C_truename;
+}
 
 // 22.2.3.1.1, p4
-string_type falsename () const;
+string_type falsename () const {
+return _C_falsename;
+}
 
 static _RW::__rw_facet_id id;
 
@@ -112,15 +128,13 @@ protected:
 
 private:
 
-int _C_flags;   // bitmap of cached data valid flags
-string  _C_grouping;// cached results of virtual members
+string  _C_grouping;
 string_type _C_truename;
 string_type _C_falsename;
 char_type   _C_decimal_point;
 char_type   _C_thousands_sep;
 };
 
-
 #ifndef _RWSTD_NO_SPECIALIZED_FACET_ID
 
 _RWSTD_SPECIALIZED_CLASS
@@ -134,95 +148,6 @@ _RW::__rw_facet_id numpunctwchar_t::id
 #  endif   // _RWSTD_NO_WCHAR_T
 #endif   // _RWSTD_NO_SPECIALIZED_FACET_ID
 
-
-template class _CharT
-inline _TYPENAME numpunct_CharT::char_type
-numpunct_CharT::decimal_point () const
-{
-if (!(_C_flags  _RW::__rw_dp)) {
-
-numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this);
-
-// [try to] get the decimal point first (may throw)
-// then set a flag to avoid future

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-05 Thread Liviu Nicoara

On 09/05/12 00:51, Stefan Teleman wrote:

On Tue, Sep 4, 2012 at 10:49 PM, Martin Sebor mse...@gmail.com wrote:


   template class _CharT
   inline string numpunct_CharT::grouping () const
   {
   if (!(_C_flags  _RW::__rw_gr)) {

   numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this);

  _RWSTD_MT_GUARD (_C_mutex);

  // [try to] get the grouping first (may throw)
   // then set a flag to avoid future initializations
   __self-_C_grouping  = do_grouping ();
   __self-_C_flags|= _RW::__rw_gr;
   }

   return _C_grouping;
   }


That's what I wanted to do originally - use a per-object mutex.


FWIW, it is pretty obvious to me _now_ that these assignments are not MT-safe, 
by themselves and also in the context of the test guarding them.

You're right, access must be synchronized on a per-facet object basis. Since 
the data that needs to be protected on assignment is instance data, the mutex 
used in the guard must be a (yet to be added) instance mutex. That would make 
it binary incompatible and would go in ... 4.3.x?




This works:

template class _CharT
inline string numpunct_CharT::grouping () const
{
 if (!(_C_flags  _RW::__rw_gr)) {

 numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this);

 _RWSTD_MT_STATIC_GUARD (_Type);


It seems to me that a static guard is excessive. The only thing that needs 
guarding is the actual assignment to facet instance members -- a mutex instance member 
would suffice.



And even so, this is still not thread-safe:

Two different threads [ T1 and T2 ], seeking two different locales
[en_US.UTF-8 and ja_JP.UTF-8], enter std::numpunct_CharT::grouping()
at the same time - because they are running on two different cores.
They both test for

 if (!(_C_flags  _RW::__rw_gr))

and then -- assuming the expression above evaluates to true -- one of
them wins the mutex [T1], and the other one [T2] blocks on the mutex.

When T1 is done and exits the function, the grouping is set to
en_US.UTF-8 and the mutex is released. Now T2 acquires the mutex, and
proceeds to setting grouping to ja-JP.UTF-8. Woe on the caller running
from T1 who now thinks he got en_US.UTF-8, but instead he gets
ja_JP.UTF-8, which was duly set so by T2, but T1 had no clue about it
(remember, the std::string grouping _charT buffer is shared by the
caller from T1 and the caller from T2).


I am pretty sure that in your example, they would operate on two different instances of the facet 
class. The static guard would synchronize their running alright but through two 
different facet instances, copying data from two different places of the locale database. In this 
respect a static guard is excessive.

I would defer it to Martin to validate a course of action but to me it looks 
like the only problem is the concurrent assignment to facet instance member 
variables inside the facet member functions. Which could be easily and nicely 
solved with a plain guard on a mutex ordinary member variable.

Thanks!

Liviu


Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-05 Thread Liviu Nicoara

On 09/05/12 14:09, Stefan Teleman wrote:

On Wed, Sep 5, 2012 at 10:52 AM, Martin Sebor mse...@gmail.com wrote:
[...]
OK so I did a little bit of testing, after looking at the *right*
__rw_guard class. :-)

I changed the std::numpunct class thusly:
[...]
And then:

template class _CharT
inline string numpunct_CharT::grouping () const
{
 if (!(_C_flags  _RW::__rw_gr)) {

 _RWSTD_MT_GUARD (_C_object_mutex);

 // double-test here to avoid re-writing an already written string
 if (!(_C_flags  _RW::__rw_gr)) {

 numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this);

 // [try to] get the grouping first (may throw)
 // then set a flag to avoid future initializations
 __self-_C_grouping  = do_grouping ();
 __self-_C_flags|= _RW::__rw_gr;

 }
 }

 return _C_grouping;
}


I am afraid this would be unsafe, too (if I said otherwise earlier I was 
wrong). The compiler might re-arrange the protected assignments, such that 
another thread sees a partially updated object, where the flags are updated and 
the string not. I don't think we're going to get away with this here without 
either a simpler and more inefficient top-level locking, or doing away 
completely with the lazy initialization. Thoughts?

Liviu


  1   2   >