On 9/11/12 9:40 PM, Martin Sebor wrote:
On 09/11/2012 04:15 PM, Stefan Teleman wrote:
On Mon, Sep 10, 2012 at 4:24 PM, Stefan Teleman
I think I have something which doesn't break BC - stay tuned because
I'm testing it now.
OK.
So, here's a possible implementation of __rw_get_numpunct() with
minimal locking, which passes the MT tests and does not break ABI:
s247136804.onlinehome.us/stdcxx-1056-20120911/punct.cpp
And the same for include/loc/_numpunct.h:
http://s247136804.onlinehome.us/stdcxx-1056-20120911/_numpunct.h
In _numpunct.h, all the functions perform no checks and no lazy
initialization. They function simply as a pass-through to
__rw_get_numpunct(). std::numpunctT's data members are now dead
varaiables.
The bad: performance is no better than with locking the mutex inside
each of the std::numpunctT::*() functions, and with lazy
instantiation.
I wouldn't expect this to be faster than the original. In fact,
I would expect it to be slower because each call to one of the
public, non-virtual members results in a call to the out-of-line
virtual functions (and another to __rw_get_moneypunct). Avoiding
the overhead of such calls is the main and only reason why the
caching exists.
AFAICT, there are two cases to consider:
1. Using STDCXX locale database initializes the __rw_punct_t data in the first,
properly synchronized pass through __rw_get_numpunct. All subsequent calls use
the __rw_punct_t data to construct returned objects.
2. Using the C library locales does the same in the first pass, via setlocale
and localeconv, but setlocale synchronization is via a per-process lock. The
facet data, once initialized is used just like above.
I probably missed this in the previous conversation, but did you detect a race
condition in the tests if the facets are simply forwarding to the private
virtual interface? I.e., did you detect that the facet initialization code is
unsafe? I think the facet __rw_punct_t data is safely initialized in both cases,
it's the caching that is done incorrectly.
I'm afraid unoptimized timings don't tell us much. Neither does
a comparison between two compilers, even on the same OS.
I looked at Liviu's timings today. I was puzzled by the difference
between (1) which, IIUC, is the current implementation (presumably
an optimized, thread-safe build with the same compiler and OS) and
(4), which, again IIUC, is the equivalent of your latest patch here
(again, presumably optimized, thread safe, same compiler/OS). I'm
having trouble envisioning how calling a virtual function to
retrieve the value of grouping can possibly be faster than not
calling it (and simply returning the value cached in the data
member of the facet.
The new results I attached to the issue come from a a bit clearer tests and they
focus on just two cases: the current implementation vs. a non-caching one; the
latter just forwards the grouping calls to the protected do_grouping, with _no_
other changes to the implementation.
The timing numbers seem to show that MT builds fare far worse with the caching
than without. Stefan, if you have the time, could you please infirm :) my
conclusions by timing it on one of your machines?
Thanks,
Liviu