Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-05 Thread Liviu Nicoara

On 09/05/12 00:51, Stefan Teleman wrote:

On Tue, Sep 4, 2012 at 10:49 PM, Martin Sebor mse...@gmail.com wrote:


   template class _CharT
   inline string numpunct_CharT::grouping () const
   {
   if (!(_C_flags  _RW::__rw_gr)) {

   numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this);

  _RWSTD_MT_GUARD (_C_mutex);

  // [try to] get the grouping first (may throw)
   // then set a flag to avoid future initializations
   __self-_C_grouping  = do_grouping ();
   __self-_C_flags|= _RW::__rw_gr;
   }

   return _C_grouping;
   }


That's what I wanted to do originally - use a per-object mutex.


FWIW, it is pretty obvious to me _now_ that these assignments are not MT-safe, 
by themselves and also in the context of the test guarding them.

You're right, access must be synchronized on a per-facet object basis. Since 
the data that needs to be protected on assignment is instance data, the mutex 
used in the guard must be a (yet to be added) instance mutex. That would make 
it binary incompatible and would go in ... 4.3.x?




This works:

template class _CharT
inline string numpunct_CharT::grouping () const
{
 if (!(_C_flags  _RW::__rw_gr)) {

 numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this);

 _RWSTD_MT_STATIC_GUARD (_Type);


It seems to me that a static guard is excessive. The only thing that needs 
guarding is the actual assignment to facet instance members -- a mutex instance member 
would suffice.



And even so, this is still not thread-safe:

Two different threads [ T1 and T2 ], seeking two different locales
[en_US.UTF-8 and ja_JP.UTF-8], enter std::numpunct_CharT::grouping()
at the same time - because they are running on two different cores.
They both test for

 if (!(_C_flags  _RW::__rw_gr))

and then -- assuming the expression above evaluates to true -- one of
them wins the mutex [T1], and the other one [T2] blocks on the mutex.

When T1 is done and exits the function, the grouping is set to
en_US.UTF-8 and the mutex is released. Now T2 acquires the mutex, and
proceeds to setting grouping to ja-JP.UTF-8. Woe on the caller running
from T1 who now thinks he got en_US.UTF-8, but instead he gets
ja_JP.UTF-8, which was duly set so by T2, but T1 had no clue about it
(remember, the std::string grouping _charT buffer is shared by the
caller from T1 and the caller from T2).


I am pretty sure that in your example, they would operate on two different instances of the facet 
class. The static guard would synchronize their running alright but through two 
different facet instances, copying data from two different places of the locale database. In this 
respect a static guard is excessive.

I would defer it to Martin to validate a course of action but to me it looks 
like the only problem is the concurrent assignment to facet instance member 
variables inside the facet member functions. Which could be easily and nicely 
solved with a plain guard on a mutex ordinary member variable.

Thanks!

Liviu


Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-05 Thread Martin Sebor

That's what I wanted to do originally - use a per-object mutext.
Unfortunately the _C_mutex member in rw::__rw_synchronized is static:


That's the wrong __rw_synchronized -- this one is for non-MT builds.
The one we get in MT builds in on line 370.



struct __rw_synchronized
{
 // static so that it takes up no space
 static _RWSTD_EXPORT __rw_mutex _C_mutex;

...

and __rw::rw_guard doesn't have an appropriate constructor.

Intel C++ complains about it too:

/src/steleman/programming/stdcxx-intel/stdcxx-4.2.1-thread-safe/include/loc/_numpunct.h(181):
error: no instance of constructor __rw::__rw_guard::__rw_guard
matches the argument list
 argument types are: (const __rw::__rw_mutex)
   _RWSTD_MT_GUARD (_C_mutex);
   ^


That's because the function is const and the guard takes a non-const
reference. This might be a defect in  __rw_synchronized: the mutex
member should probably be mutable.




...

And even so, this is still not thread-safe:

Two different threads [ T1 and T2 ], seeking two different locales
[en_US.UTF-8 and ja_JP.UTF-8], enter std::numpunct_CharT::grouping()
at the same time - because they are running on two different cores.


Those threads won't have the same numpunct facet. A single facet
is always associated with exactly one locale and the relationship
can never change. The locale class guarantees this.


They both test for

 if (!(_C_flags  _RW::__rw_gr))

and then -- assuming the expression above evaluates to true -- one of
them wins the mutex [T1], and the other one [T2] blocks on the mutex.

When T1 is done and exits the function, the grouping is set to
en_US.UTF-8 and the mutex is released. Now T2 acquires the mutex, and
proceeds to setting grouping to ja-JP.UTF-8. Woe on the caller running
from T1 who now thinks he got en_US.UTF-8, but instead he gets
ja_JP.UTF-8, which was duly set so by T2, but T1 had no clue about it
(remember, the std::string grouping _charT buffer is shared by the
caller from T1 and the caller from T2).

So at a minimum, the locking must happen before evaluating the

 if (!(_C_flags  _RW::__rw_gr))

expression.


You're right that there is a race here. But the race is benign
because the T2 will assign the same value to the string as T1
did (because the grouping must be the same in the same locale).
This doesn't reallocate the string but simply overwrites each
byte with its own value. On all architectures we support this
is atomic and safe.

Martin



This still doesn't solve what ends up being returned in grouping. If
we lock at the top of the function, then, when T2 acquires the mutex,
the test expression will evaluate to false. Therefore T2 will return
whatever is in grouping right now, which happens to be en_US.UTF-8 as
set by T1, when T2 really wanted ja_JP.UTF-8.

I really think the appropriate fix here -- which would address the
performance implications -- is more complex than this. I am thinking
about creating and using a (non-publicly accessible) internal locale
cache:

typedef std::mapstd::string, std::locale  locale_cache;

where all the locales are stored fully initialized, on demand. There
is only one locale instantiation and initialization overhead cost per
locale. After a locale has been instantiated and placed into the
cache, the caller of any specfic locale gets a copy from the cache,
fully instantiated and initialized. But this breaks ABI, so I'm
thinking it's for stdcxx 5.

Thoughts?

--Stefan





RE: Intel C++ bug reports?

2012-09-05 Thread Travis Vitek

On 09/05/12 04:20, Liviu Nicoara wrote:
 On 09/04/12 21:25, Martin Sebor wrote:
 On 09/04/2012 07:02 PM, Liviu Nicoara wrote:
 Hi guys,

 snip

 Looking at the test below, though, it depends on undefined behavior
 (signed overflow) so there's no compiler bug. Making max volatile
 fools icc just enough to produce the expected output (while still
 relying on undefined behavior). It would be good to clean it up,
 though. I think computing UINT_MAX instead and shifting it right
 by the number of sign bits (i.e., 1) should work.
 
 I _know_ it's undefined behavior. :) My case is that Intel is also the
 only compiler failing this test. On that grounds alone they should look
 at it -- I know the gcc guys do when it comes to their compiler. Let
 them shoot it down if they so wish.

Rogue Wave filed an issue with Intel on 2011/04/25 (issue #628095). They shot 
it down.

Travis


Re: Intel C++ bug reports?

2012-09-05 Thread Martin Sebor

On 09/05/2012 05:20 AM, Liviu Nicoara wrote:

On 09/04/12 21:25, Martin Sebor wrote:

On 09/04/2012 07:02 PM, Liviu Nicoara wrote:

Hi guys,

Does any of you know how to go about submitting an Intel compiler bug
without a premier support account?

While configuring the library on my x86_64 machine, I ran into what
appears to be a code generation compiler bug which affects LIMITS.cpp
test -- the test cycles ad infinitum because of the incorrect test
marked below:


I don't know if there's a way to submit it outside Premier but I
have an account and can submit bugs for us.

Looking at the test below, though, it depends on undefined behavior
(signed overflow) so there's no compiler bug. Making max volatile
fools icc just enough to produce the expected output (while still
relying on undefined behavior). It would be good to clean it up,
though. I think computing UINT_MAX instead and shifting it right
by the number of sign bits (i.e., 1) should work.


I _know_ it's undefined behavior. :) My case is that Intel is also the
only compiler failing this test. On that grounds alone they should look
at it -- I know the gcc guys do when it comes to their compiler. Let
them shoot it down if they so wish.


GCC also implements a similar optimization (-fstrict-overflow).
IIRC, we've already tweaked the test to get around it at least
once (by declaring zero, one and two volatile). GCC happens to
miss this case, but ICC doesn't and optimizes the test away,
turning the whole for statement into an infinite loop.

Martin

PS Here's a nice example of a GCC bug filed for the same type
of a problem as we have here:

  http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36862



Thanks.

Liviu





Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-05 Thread Stefan Teleman
On Wed, Sep 5, 2012 at 10:52 AM, Martin Sebor mse...@gmail.com wrote:

 You're right that there is a race here. But the race is benign
 because the T2 will assign the same value to the string as T1
 did (because the grouping must be the same in the same locale).
 This doesn't reallocate the string but simply overwrites each
 byte with its own value. On all architectures we support this
 is atomic and safe.

OK so I did a little bit of testing, after looking at the *right*
__rw_guard class. :-)

I changed the std::numpunct class thusly:

template class _CharT
class numpunct : public _RW::__rw_facet
{
 public:
 // [ ... snip, no changes here ... ]

 private:

int _C_flags;   // bitmap of cached data valid flags
string  _C_grouping;// cached results of virtual members
string_type _C_truename;
string_type _C_falsename;
char_type   _C_decimal_point;
char_type   _C_thousands_sep;
mutable _RW::__rw_mutex  _C_object_mutex; // -
};

And then:

template class _CharT
inline string numpunct_CharT::grouping () const
{
if (!(_C_flags  _RW::__rw_gr)) {

_RWSTD_MT_GUARD (_C_object_mutex);

// double-test here to avoid re-writing an already written string
if (!(_C_flags  _RW::__rw_gr)) {

numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this);

// [try to] get the grouping first (may throw)
// then set a flag to avoid future initializations
__self-_C_grouping  = do_grouping ();
__self-_C_flags|= _RW::__rw_gr;

}
}

return _C_grouping;
}

same changes for std::numpuncttruename() and std::numpunct::falsename().

Ran 22.locale.numpunct.mt with the default values for nthreads and nloops.

Yesterday's test results, with the static per-class mutex:

# INFO (S1) (10 lines):
# TEXT:
# COMPILER: Intel C++, __INTEL_COMPILER = 1210,
__INTEL_COMPILER_BUILD_DATE = 20111011, __EDG_VERSION__ = 403
# ENVIRONMENT: pentiumpro running linux-elf (Fedora release 17 (Beefy
Miracle) (3.5.0-2.fc17.x86_64)) with glibc 2.15
# FILE: 22.locale.numpunct.mt.cpp
# COMPILED:  Sep  4 2012, 09:11:36
# COMMENT: thread safety


# CLAUSE: lib.locale.numpunct

# NOTE (S2) (5 lines):
# TEXT: executing locale -a  /tmp/tmpfile-V7siTq
# CLAUSE: lib.locale.numpunct
# FILE: process.cpp
# LINE: 276

# INFO (S1) (3 lines):
# TEXT: testing std::numpunctcharT with 8 threads, 20 iterations
each, in 32 locales { C aa_DJ aa_DJ.iso88591 aa_DJ.utf8
aa_ER aa_ER@saaho aa_ER.utf8 aa_ER.utf8@saaho aa_ET
aa_ET.utf8 af_ZA af_ZA.iso88591 af_ZA.utf8 am_ET
am_ET.utf8 an_ES an_ES.iso885915 an_ES.utf8 ar_AE
ar_AE.iso88596 ar_AE.utf8 ar_BH ar_BH.iso88596 ar_BH.utf8
ar_DZ ar_DZ.iso88596 ar_DZ.utf8 ar_EG ar_EG.iso88596
ar_EG.utf8 ar_IN ar_IN.utf8 }
# CLAUSE: lib.locale.numpunct

[ ... ]

# +---+--+--+--+
# | DIAGNOSTIC|  ACTIVE  |   TOTAL  | INACTIVE |
# +---+--+--+--+
# | (S1) INFO |   11 |   11 |   0% |
# | (S2) NOTE |1 |1 |   0% |
# | (S8) ERROR|0 |3 | 100% |
# | (S9) FATAL|0 |1 | 100% |
# +---+--+--+--+
real 2139.31
user 2406.09
sys 155.61

===

Today's test results with the per-object mutex (as shown above):

# INFO (S1) (10 lines):
# TEXT:
# COMPILER: Intel C++, __INTEL_COMPILER = 1210,
__INTEL_COMPILER_BUILD_DATE = 20111011, __EDG_VERSION__ = 403
# ENVIRONMENT: pentiumpro running linux-elf (Fedora release 17 (Beefy
Miracle) (3.5.0-2.fc17.x86_64)) with glibc 2.15
# FILE: 22.locale.numpunct.mt.cpp
# COMPILED: Sep  5 2012, 13:08:03
# COMMENT: thread safety


# CLAUSE: lib.locale.numpunct

# NOTE (S2) (5 lines):
# TEXT: executing locale -a  /tmp/tmpfile-ww0nez
# CLAUSE: lib.locale.numpunct
# FILE: process.cpp
# LINE: 276

# INFO (S1) (3 lines):
# TEXT: testing std::numpunctcharT with 8 threads, 20 iterations
each, in 32 locales { C aa_DJ aa_DJ.iso88591 aa_DJ.utf8
aa_ER aa_ER@saaho aa_ER.utf8 aa_ER.utf8@saaho aa_ET
aa_ET.utf8 af_ZA af_ZA.iso88591 af_ZA.utf8 am_ET
am_ET.utf8 an_ES an_ES.iso885915 an_ES.utf8 ar_AE
ar_AE.iso88596 ar_AE.utf8 ar_BH ar_BH.iso88596 ar_BH.utf8
ar_DZ ar_DZ.iso88596 ar_DZ.utf8 ar_EG ar_EG.iso88596
ar_EG.utf8 ar_IN ar_IN.utf8 }
# CLAUSE: lib.locale.numpunct

[ ... ]

# +---+--+--+--+
# | DIAGNOSTIC|  ACTIVE  |   TOTAL  | INACTIVE |
# +---+--+--+--+
# | (S1) INFO |   11 |   11 |   0% |
# | (S2) NOTE |1 |1 |   0% |
# | (S8) ERROR|0 |3 | 100% |
# | (S9) FATAL|0 |1 

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-05 Thread Liviu Nicoara

On 09/05/12 14:09, Stefan Teleman wrote:

On Wed, Sep 5, 2012 at 10:52 AM, Martin Sebor mse...@gmail.com wrote:
[...]
OK so I did a little bit of testing, after looking at the *right*
__rw_guard class. :-)

I changed the std::numpunct class thusly:
[...]
And then:

template class _CharT
inline string numpunct_CharT::grouping () const
{
 if (!(_C_flags  _RW::__rw_gr)) {

 _RWSTD_MT_GUARD (_C_object_mutex);

 // double-test here to avoid re-writing an already written string
 if (!(_C_flags  _RW::__rw_gr)) {

 numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this);

 // [try to] get the grouping first (may throw)
 // then set a flag to avoid future initializations
 __self-_C_grouping  = do_grouping ();
 __self-_C_flags|= _RW::__rw_gr;

 }
 }

 return _C_grouping;
}


I am afraid this would be unsafe, too (if I said otherwise earlier I was 
wrong). The compiler might re-arrange the protected assignments, such that 
another thread sees a partially updated object, where the flags are updated and 
the string not. I don't think we're going to get away with this here without 
either a simpler and more inefficient top-level locking, or doing away 
completely with the lazy initialization. Thoughts?

Liviu


Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-05 Thread Martin Sebor

On 09/05/2012 12:40 PM, Liviu Nicoara wrote:

On 09/05/12 14:09, Stefan Teleman wrote:

On Wed, Sep 5, 2012 at 10:52 AM, Martin Sebor mse...@gmail.com wrote:
[...]
OK so I did a little bit of testing, after looking at the *right*
__rw_guard class. :-)

I changed the std::numpunct class thusly:
[...]
And then:

template class _CharT
inline string numpunct_CharT::grouping () const
{
if (!(_C_flags  _RW::__rw_gr)) {

_RWSTD_MT_GUARD (_C_object_mutex);

// double-test here to avoid re-writing an already written string
if (!(_C_flags  _RW::__rw_gr)) {

numpunct* const __self = _RWSTD_CONST_CAST (numpunct*, this);

// [try to] get the grouping first (may throw)
// then set a flag to avoid future initializations
__self-_C_grouping = do_grouping ();
__self-_C_flags |= _RW::__rw_gr;

}
}

return _C_grouping;
}


I am afraid this would be unsafe, too (if I said otherwise earlier I was
wrong). The compiler might re-arrange the protected assignments, such
that another thread sees a partially updated object, where the flags are
updated and the string not. I don't think we're going to get away with
this here without either a simpler and more inefficient top-level
locking, or doing away completely with the lazy initialization. Thoughts?


You're right, there's still a problem. We didn't get the double
checked locking quite right. We need to prevent the reordering
both by the compiler and by the hardware so that the reader
doesn't get an out of date value of _C_grouping. Making the
members volatile and/or moving the body of the first if into
an out-of-line function should help with the compiler reordering
but we'll need a barrier to prevent the hardware from serving us
up stale data (i.e., letting T2 see the updated _C_flags but
a _C_grouping that's still being modified). Calling
pthread_mutex_unlock does insert a barrier, but the barrier
is only executed when both tests pass, and not when the first
one fails. We need the barrier in both cases. This seems like
it would get too ugly (and big) to put in an inline function.

Martin



Liviu




Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-05 Thread Martin Sebor

On 09/05/2012 01:33 PM, Liviu Nicoara wrote:

On 09/05/12 15:17, Martin Sebor wrote:

On 09/05/2012 12:40 PM, Liviu Nicoara wrote:

On 09/05/12 14:09, Stefan Teleman wrote:

On Wed, Sep 5, 2012 at 10:52 AM, Martin Sebor mse...@gmail.com wrote:
[...]
OK so I did a little bit of testing, after looking at the *right*
__rw_guard class. :-)

I changed the std::numpunct class thusly:
[...]


I am afraid this would be unsafe, too (if I said otherwise earlier I was
wrong). [...] Thoughts?


You're right, there's still a problem. We didn't get the double
checked locking quite right.


I wish for simplicity: eliminate the lazy initialization, and fully
initialize the facet in the constructor. Then we'd need no locking on
copying its data (std::string takes care of its own copying).


I'm not sure how easily we can do that. Almost all of locale
is initialized lazily. Some of the layers might depend on the
facets being initialized lazily as well. This was a deliberate
design choice. One of the constraints was to avoid dynamic
initialization or allocation at startup. Another was to be
able to use iostreams (to a limited extent) in low memory
conditions to write an error message to stderr. It's been
a long time and the details are more than a little fuzzy.
I'll need to spend some time going through the code and
refreshing my memory.

Martin



Liviu





Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-05 Thread Pavel Heimlich, a.k.a. hajma
2012/9/5 Stefan Teleman stefan.tele...@gmail.com:
 On Tue, Sep 4, 2012 at 9:15 PM, Liviu Nicoara nikko...@hates.ms wrote:

 That is good, right?

 Yes, it is good. :-)


 Could not get an Intel build off the ground on the account of the LIMITS.cpp 
 test not running to completion because of a possible compiler bug. I posted 
 earlier an account of it. Do you have a support account that allows posting 
 bug reports for Intel's C++ compiler?

 Oh, yes I remember that now, it happened to me too. I worked around it
 by copying the LIMITS executable from a gcc build.

 I don't have a support contract with Intel - I'm using the free
 compilers at home. I'm guessing Intel must have an open forum for
 Intel developers using their compilers, so that might be a way to
 report the bug. But I haven't looked for it.



 I did not have access to a Solaris box since I left Rogue Wave. Does 
 Sun/Oracle have any program that would allow one to test drive their 
 compiler on a shared box somewhere? I vaguely remember that HP had something 
 like that a while ago.

 I know Sun had such a program. Oracle I don't really know but my
 educated guess is not a chance. ;-)

Gold and above members of the oracle parter network can* get access to
a lab running various Oracle HW and SW.
For regular persons Stefan's answer applies 100%.

However there's nothing stopping you from downloading Solaris 11 iso
or a virtualbox image from
http://www.oracle.com/technetwork/server-storage/solaris11/downloads/index.html
installing it in a virtualization solution of your choice (virtualbox
works best for me) and installing Studio from
http://www.oracle.com/technetwork/server-storage/solarisstudio/downloads/index.html

As long as the installation is used for sw development and not for
production use you do not need to pay Oracle a dime, you just won't
get any updates/patches. (check the licenses fo details, I'm no
lawyer)

P.

* this was announced in April, I have no idea about the current status


 One way of testing with SunPro is to download the free version of
 SunPro from Oracle, and use it on Linux (yes, SunPro exists for Linux
 as well). It works very well on OpenSUSE and Ubuntu (tested and used
 on both myself, debugged stdcxx on OpenSUSE with SunPro 12.2. Doesn't
 work at all on Fedora 17 because of the TLS glibc bug (however it used
 to on earlier Fedora releases).

 And now for the latest about 1056:

 I managed (after many, MANY tries) to get a run of
 22.locale.numpunct.mt on Solaris SPARC 32-bit build (at work), and
 without the thread-safety patches applied, and I hit the jackpot:

 # INFO (S1) (10 lines):
 # TEXT:
 # COMPILER: SunPro, __SUNPRO_CC = 0x5100
 # ENVIRONMENT: sparc-v8 running sunos-5.11
 # FILE: 22.locale.numpunct.mt.cpp
 # COMPILED: Sep  4 2012, 16:10:03
 # COMMENT: thread safety
 

 # CLAUSE: lib.locale.numpunct

 # NOTE (S2) (5 lines):
 # TEXT: executing /usr/bin/locale -a  /var/tmp/tmpfile-MWa4bQ
 # CLAUSE: lib.locale.numpunct
 # FILE: process.cpp
 # LINE: 276

 # INFO (S1) (3 lines):
 # TEXT: testing std::numpunctcharT with 4 threads, 1000 iterations
 each, in 32 locales { C POSIX af_ZA.UTF-8 ar_AE.UTF-8
 ar_BH.UTF-8 ar_DZ.UTF-8 ar_EG.ISO8859-6 ar_EG.UTF-8
 ar_IQ.UTF-8 ar_JO.UTF-8 ar_KW.UTF-8 ar_LY.UTF-8 ar_MA.UTF-8
 ar_OM.UTF-8 ar_QA.UTF-8 ar_SA.UTF-8 ar_TN.UTF-8 ar_YE.UTF-8
 as_IN.UTF-8 az_AZ.UTF-8 be_BY.UTF-8 bg_BG.ISO8859-5
 bg_BG.UTF-8 bn_IN.UTF-8 bs_BA.ISO8859-2 bs_BA.UTF-8
 ca_ES.ISO8859-1 ca_ES.ISO8859-15 ca_ES.UTF-8 cs_CZ.ISO8859-2
 cs_CZ.UTF-8 cs_CZ.UTF-8@euro }
 # CLAUSE: lib.locale.numpunct

 [ ... snip ... ]

 # INFO (S1) (4 lines):
 # TEXT: creating a thread pool with 4 threads
 # CLAUSE: lib.locale.numpunct
 # LINE: 548

 Abort (core dump)

 /builds/steleman/userland-stdcxx-upstream-thread-safety/components/stdcxx/stdcxx-4.2.1/tests/localization/22.locale.numpunct.mt.cpp:140:
 Assertion '0 == rw_strncmp (grp.c_str (), data.grouping_.c_str ())'
 failed.
 /builds/steleman/userland-stdcxx-upstream-thread-safety/components/stdcxx/stdcxx-4.2.1/build/lib/libstdcxx4.so.4.2.1'__1cE__rwQ__rw_assert_fail6Fpkc2i2_v_+0x78
 [0xff249af8]
 /builds/steleman/userland-stdcxx-upstream-thread-safety/components/stdcxx/stdcxx-4.2.1/build/tests/22.locale.numpunct.mt'thread_func+0x500
 [0x1b050]
 /lib/libc.so.1'_lwp_start+0x0 [0xf7b553f0]

 Bingo!

 At line 140 in 22.locale.numpunct.mt.cpp, we see:

  RW_ASSERT (0 == rw_strncmp (grp.c_str (),
 data.grouping_.c_str ()));

 What this means:

 Between the time std::string grp was initialized at line 133:

 const std::string grp = np.grouping ();

 and the time we hit line 140 (the RW_ASSERT), another thread modified
 the std::string _C_grouping member of class

 template class _CharT class numpunct : public _RW::__rw_facet.

 And because this is the non-patched version of _numpunct.h, the
 variable std::string grp did not really return a deep copy of
 _C_grouping, but a shared 

Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-05 Thread Stefan Teleman
On Wed, Sep 5, 2012 at 4:20 PM, Stefan Teleman stefan.tele...@gmail.com wrote:

 But then there's another aspect  -- which I probably failed to
 highlight in my previous email: the per-object mutex implementation is
 20% *slower* than the class-static mutex implementation.

 class-static implementation:
 real 2139.31
 user 2406.09
 sys 155.61

 pe-object implementation:
 real 2416.75
 user 2694.64
 sys 159.49

The above results are for the Intel compiler. Now here are the results
with GCC 4.7.0:

1. with the static per-class mutex:

# INFO (S1) (10 lines):
# TEXT:
# COMPILER: gcc 4.7.0, __VERSION__ = 4.7.0 20120507 (Red Hat 4.7.0-5)
# ENVIRONMENT: pentiumpro running linux-elf (Fedora release 17 (Beefy
Miracle) (3.5.0-2.fc17.x86_64)) with glibc 2.15
# FILE: 22.locale.numpunct.mt.cpp
# COMPILED: Sep  5 2012, 06:21:18
# COMMENT: thread safety


[ ... ]

# +---+--+--+--+
# | DIAGNOSTIC|  ACTIVE  |   TOTAL  | INACTIVE |
# +---+--+--+--+
# | (S1) INFO |   11 |   11 |   0% |
# | (S2) NOTE |1 |1 |   0% |
# | (S8) ERROR|0 |3 | 100% |
# | (S9) FATAL|0 |1 | 100% |
# +---+--+--+--+
real 2165.06
user 2428.08
sys 151.30


2. With the per-object mutex:

# INFO (S1) (10 lines):
# TEXT:
# COMPILER: gcc 4.7.0, __VERSION__ = 4.7.0 20120507 (Red Hat 4.7.0-5)
# ENVIRONMENT: pentiumpro running linux-elf (Fedora release 17 (Beefy
Miracle) (3.5.0-2.fc17.x86_64)) with glibc 2.15
# FILE: 22.locale.numpunct.mt.cpp
# COMPILED: Sep  5 2012, 21:29:56
# COMMENT: thread safety


# CLAUSE: lib.locale.numpunct

 [ ... ]

# +---+--+--+--+
# | DIAGNOSTIC|  ACTIVE  |   TOTAL  | INACTIVE |
# +---+--+--+--+
# | (S1) INFO |   11 |   11 |   0% |
# | (S2) NOTE |1 |1 |   0% |
# | (S8) ERROR|0 |3 | 100% |
# | (S9) FATAL|0 |1 | 100% |
# +---+--+--+--+
real 2438.70
user 2726.44
sys 155.79

About the same percentage difference as the Intel compiler.

--Stefan

---
Stefan Teleman
KDE e.V.
stefan.tele...@gmail.com


Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-05 Thread Stefan Teleman
On Wed, Sep 5, 2012 at 10:55 PM, Martin Sebor mse...@gmail.com wrote:

 I suspect the difference is due to the overhead of the repeated
 initialization and destruction of the per-object mutex in the
 test. The test repeatedly creates (and discards) named locale
 objects.

 The per-class mutex is initialized just once in the process, no
 matter how many facet objects (how many distinct named locales)
 the test creates. But the per-object mutex must be created (and
 destroyed) for each named locale.

Agreed.

But: if the choice is between an implementation which [1] breaks ABI
and [2] performs 20% worse -- even in contrived test cases -- than
another implementation [2] which doesn't break ABI, and performs
better than the first one,  why would we even consider the first one?

--Stefan

-- 
Stefan Teleman
KDE e.V.
stefan.tele...@gmail.com


Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-05 Thread Martin Sebor

On 09/05/2012 09:03 PM, Stefan Teleman wrote:

On Wed, Sep 5, 2012 at 10:55 PM, Martin Sebor mse...@gmail.com wrote:


I suspect the difference is due to the overhead of the repeated
initialization and destruction of the per-object mutex in the
test. The test repeatedly creates (and discards) named locale
objects.

The per-class mutex is initialized just once in the process, no
matter how many facet objects (how many distinct named locales)
the test creates. But the per-object mutex must be created (and
destroyed) for each named locale.


Agreed.

But: if the choice is between an implementation which [1] breaks ABI
and [2] performs 20% worse -- even in contrived test cases -- than
another implementation [2] which doesn't break ABI, and performs
better than the first one,  why would we even consider the first one?


Breaking the ABI is not an option (unless we rev the version).
But I'm not sure I understand what you think breaks the ABI.
We don't need to add a new mutex -- we can use the __rw_facet
member for the locking. Or did you mean something else?

Martin



--Stefan





Re: STDCXX-1056 [was: Re: STDCXX forks]

2012-09-05 Thread Martin Sebor

Incidentally, a benchmark that I would expect to be more
representative of real programs than the test we've been
using is one that creates a small number of named locales
up front and then uses each in a loop in each thread (as
opposed to creating and destroying it in each iteration
and thread). With this setup, the per-object mutex
implementation should be much faster than the one with
the per-class mutex because of the absence of contention.

The test has an option that exercises this setup:
  --shared-locale

Martin

On 09/05/2012 09:51 PM, Martin Sebor wrote:

On 09/05/2012 09:03 PM, Stefan Teleman wrote:

On Wed, Sep 5, 2012 at 10:55 PM, Martin Sebor mse...@gmail.com wrote:


I suspect the difference is due to the overhead of the repeated
initialization and destruction of the per-object mutex in the
test. The test repeatedly creates (and discards) named locale
objects.

The per-class mutex is initialized just once in the process, no
matter how many facet objects (how many distinct named locales)
the test creates. But the per-object mutex must be created (and
destroyed) for each named locale.


Agreed.

But: if the choice is between an implementation which [1] breaks ABI
and [2] performs 20% worse -- even in contrived test cases -- than
another implementation [2] which doesn't break ABI, and performs
better than the first one,  why would we even consider the first one?


Breaking the ABI is not an option (unless we rev the version).
But I'm not sure I understand what you think breaks the ABI.
We don't need to add a new mutex -- we can use the __rw_facet
member for the locking. Or did you mean something else?

Martin



--Stefan