On 10/03/12 11:10, Martin Sebor wrote:
[...]
I was just thinking of a few simple loops along the lines of:
void* thread_func (void*) {
for (int i = 0; i < N; ++)
test 1: do some simple stuff inline
test 2: call a virtual function to do the same stuff
test 3: lock and unlock a mutex and do the same stuff
}
Test 1 should be the fastest and test 3 the slowest. This should
hold regardless of what "simple stuff" is (eventually, even when
it's getting numpunct::grouping() data).
tl;dr: removing the facet data cache is a priority. All else can be put
on the back-burner.
Conflicting test results aside, there still is the case of the incorrect
handling of the cached data in the facet. I don't think there is a
disagreement on that. Considering that the std::string is moving in the
direction of dropping the handle-body implementation, simply getting rid
of the cache is a step in the same direction.
I think that we should preserve the lock-free reading of the facet data,
as a benign race, but making it benign is perhaps more complicated than
previously suggested.
As a reminder, the core of the facet access and initialization code
essentially looks like this (pseudocode-ish):
// facet data accessor
...
if (0 == _C_impsize) { // 1
mutex_lock ();
if (_C_impsize)
return _C_data;
_C_data = get_facet_data (); // 2
?? // 3
_C_impsize = 1; // 4
mutex_unlock ();
}
?? // 5
return _C_data; // 6
...
with question marks for missing, necessary fixes. The compiler needs to
be prevented from re-ordering both 2-4 and 1-6. Just for the sake of
argument I can imagine an optimization that reorders the reads in 1-6:
register x = _C_data;
if (_C_impsize)
return x;
and if the loads are executed in this order, the caller will see a stale
_C_data.
First, the 2-4 writes need to be executed in the program order. This
needs both a compiler barrier and a store-store memory barrier that will
keep the writes ordered.
Then, the reads in 1-6 need to be ordered such that _C_data is read
after _C_impsize, via a compiler barrier and a load-load memory barrier
that will preserve the program order of the loads.
Various compilers provide these features in various forms, but at the
moment we don't have a unified STDCXX API to implement this.
Of course, I might be wrong. Input is appreciated.
Thanks,
Liviu