On 10/03/12 11:10, Martin Sebor wrote:
[...]
I was just thinking of a few simple loops along the lines of:

   void* thread_func (void*) {
       for (int i = 0; i < N; ++)
           test 1: do some simple stuff inline
           test 2: call a virtual function to do the same stuff
           test 3: lock and unlock a mutex and do the same stuff
   }

Test 1 should be the fastest and test 3 the slowest. This should
hold regardless of what "simple stuff" is (eventually, even when
it's getting numpunct::grouping() data).

tl;dr: removing the facet data cache is a priority. All else can be put on the back-burner.

Conflicting test results aside, there still is the case of the incorrect handling of the cached data in the facet. I don't think there is a disagreement on that. Considering that the std::string is moving in the direction of dropping the handle-body implementation, simply getting rid of the cache is a step in the same direction.

I think that we should preserve the lock-free reading of the facet data, as a benign race, but making it benign is perhaps more complicated than previously suggested.

As a reminder, the core of the facet access and initialization code essentially looks like this (pseudocode-ish):


// facet data accessor
...
    if (0 == _C_impsize) {              // 1
        mutex_lock ();
        if (_C_impsize)
            return _C_data;
        _C_data    = get_facet_data (); // 2
        ??                              // 3
        _C_impsize = 1;                 // 4
        mutex_unlock ();
    }
    ??                                  // 5
    return _C_data;                     // 6
...


with question marks for missing, necessary fixes. The compiler needs to be prevented from re-ordering both 2-4 and 1-6. Just for the sake of argument I can imagine an optimization that reorders the reads in 1-6:

    register x = _C_data;
    if (_C_impsize)
        return x;

and if the loads are executed in this order, the caller will see a stale _C_data.

First, the 2-4 writes need to be executed in the program order. This needs both a compiler barrier and a store-store memory barrier that will keep the writes ordered.

Then, the reads in 1-6 need to be ordered such that _C_data is read after _C_impsize, via a compiler barrier and a load-load memory barrier that will preserve the program order of the loads.

Various compilers provide these features in various forms, but at the moment we don't have a unified STDCXX API to implement this.

Of course, I might be wrong. Input is appreciated.

Thanks,
Liviu

Reply via email to