Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
On Wed, 8 Dec 2021 at 19:21, Jonathan Wakely wrote: > > On Wed, 8 Dec 2021 at 19:17, Rainer Orth wrote: > > > > Hi Jonathan, > > > > > I've pushed this change to trunk now (it was posted and reviewed in > > > stage 1, I just didn't get around to pushing it until now). > > > > > > The final version of the patch is attached to this mail. > > > > unfortunately, it breaks Solaris/SPARC bootstrap: > > > > In file included from > > /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/bits/shared_ptr.h:53, > > from > > /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/memory:77, > > from > > /vol/gcc/src/hg/master/local/libstdc++-v3/include/precompiled/stdc++.h:82: > > /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/bits/shared_ptr_base.h: > > In member function 'void std::_Sp_counted_base<_Lp>::_M_release() [with > > __gnu_cxx::_Lock_policy _Lp = __gnu_cxx::_S_atomic]': > > /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/bits/shared_ptr_base.h:329:26: > > error: right operand of shift expression '(1 << 64)' is greater than or > > equal to the precision 64 of the left operand [-fpermissive] > > 329 | = 1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word))); > > | ~^ > > make[9]: *** [Makefile:1875: > > sparc-sun-solaris2.11/bits/stdc++.h.gch/O2ggnu++0x.gch] Error 1 > > > > For 64-bit SPARC, _Atomic_word is long. > > Ah yes, so we need to disable this optimization. Patch coming up ... Gah, I remembered to check that: constexpr bool __double_word = sizeof(long long) == 2 * sizeof(_Atomic_word); // The ref-count members follow the vptr, so are aligned to // alignof(void*). constexpr bool __aligned = __alignof(long long) <= alignof(void*); if _GLIBCXX17_CONSTEXPR (__lock_free && __double_word && __aligned) But for C++11 and C++14 that is a normal runtime condition not if-constexpr, so the undefined shift still gets compiled, even though it can never be reached at runtime.
Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
On Wed, 8 Dec 2021 at 19:17, Rainer Orth wrote: > > Hi Jonathan, > > > I've pushed this change to trunk now (it was posted and reviewed in > > stage 1, I just didn't get around to pushing it until now). > > > > The final version of the patch is attached to this mail. > > unfortunately, it breaks Solaris/SPARC bootstrap: > > In file included from > /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/bits/shared_ptr.h:53, > from > /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/memory:77, > from > /vol/gcc/src/hg/master/local/libstdc++-v3/include/precompiled/stdc++.h:82: > /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/bits/shared_ptr_base.h: > In member function 'void std::_Sp_counted_base<_Lp>::_M_release() [with > __gnu_cxx::_Lock_policy _Lp = __gnu_cxx::_S_atomic]': > /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/bits/shared_ptr_base.h:329:26: > error: right operand of shift expression '(1 << 64)' is greater than or > equal to the precision 64 of the left operand [-fpermissive] > 329 | = 1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word))); > | ~^ > make[9]: *** [Makefile:1875: > sparc-sun-solaris2.11/bits/stdc++.h.gch/O2ggnu++0x.gch] Error 1 > > For 64-bit SPARC, _Atomic_word is long. Ah yes, so we need to disable this optimization. Patch coming up ...
Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
Hi Jonathan, > I've pushed this change to trunk now (it was posted and reviewed in > stage 1, I just didn't get around to pushing it until now). > > The final version of the patch is attached to this mail. unfortunately, it breaks Solaris/SPARC bootstrap: In file included from /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/bits/shared_ptr.h:53, from /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/memory:77, from /vol/gcc/src/hg/master/local/libstdc++-v3/include/precompiled/stdc++.h:82: /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/bits/shared_ptr_base.h: In member function 'void std::_Sp_counted_base<_Lp>::_M_release() [with __gnu_cxx::_Lock_policy _Lp = __gnu_cxx::_S_atomic]': /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/bits/shared_ptr_base.h:329:26: error: right operand of shift expression '(1 << 64)' is greater than or equal to the precision 64 of the left operand [-fpermissive] 329 | = 1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word))); | ~^ make[9]: *** [Makefile:1875: sparc-sun-solaris2.11/bits/stdc++.h.gch/O2ggnu++0x.gch] Error 1 For 64-bit SPARC, _Atomic_word is long. Rainer -- - Rainer Orth, Center for Biotechnology, Bielefeld University
Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
I've pushed this change to trunk now (it was posted and reviewed in stage 1, I just didn't get around to pushing it until now). The final version of the patch is attached to this mail. Thanks for the nice optimization, Maged! On Wed, 4 Aug 2021 at 20:49, Maged Michael via Libstdc++ wrote: > > On Wed, Aug 4, 2021 at 3:32 PM Jonathan Wakely > wrote: > > > On Wed, 4 Aug 2021 at 18:19, Maged Michael wrote: > > > > > > Sorry. I totally missed the rest of your message and the patch. My fuzzy > > eyesight, which usually guesses correctly 90% of the time, mistook > > "Secondly" on a line by itself for "Sincerely" :-) > > > > :-) > > > > > The noinlining was based on looking at generated code. That was for > > clang. It was inlining the _M_last_use function for every instance of > > _M_release (e.g., ~shared_ptr). This optimization with the noinline for > > _M_release_last_use ended up reducing massive binary text sizes by 1.5% > > (several megabytes)., which was an extra benefit. Without the noinline we > > saw code size increase. > > > > Wow, that is a convincing argument for making it not inline, thanks. > > > > > IIUC, we van use the following. Right? > > > > > > __attribute__((__noinline__)) > > > > Right. > > > > > I didn't understand the part about programmers doing #define noinline 1. > > I don't see code in the patch that uses noinline. > > > > This is a valid C++ program: > > > > #define noinline 1 > > #include > > int main() { } > > > > But if anything in uses "noinline" then this valid program > > will not compile. Which is why we must use ((__noinline__)) instead of > > ((noinline)). > > > > Thanks. Now I get it. > > > > > > > > > > > > How about something like this comment? > > > > > > // Noinline to avoid code size increase. > > > > Great, thanks. > > > > On Wed, 4 Aug 2021 at 18:34, Maged Michael wrote: > > > Actually I take back what I said. Sorry. I think the logic in your patch > > is correct. I missed some of the atomic decrements. > > > But I'd be concerned about performance. If we make _M_release_last_use > > noinline then we are adding overhead to the fast path of the original logic > > (where both counts are 1). > > > > Oh, I see. So the code duplication serves a purpose. We want the > > _M_release_last_use() code to be non-inline for the new logic, because > > in the common case we never call it (we either detect that both counts > > are 1 and do the dispose & destroy without atomic decrements, or we do > > a single decrement and don't dispose or destroy anything). But for the > > old logic, we do want that code inlined into _M_release (or > > _M_release_orig as it was called in your patch). Have I got that right > > now? > > > > Yes. Completely right. > > > > What if we remove the __noinline__ from _M_release_last_use() so that > > it can be inlined, but than add a noinline wrapper that can be called > > when we don't want to inline it? > > > > So: > > // Called by _M_release() when the use count reaches zero. > > void > > _M_release_last_use() noexcept > > { > > // unchanged from previous patch, but without the attribute. > > // ... > > } > > > > // As above, but 'noinline' to reduce code size on the cold path. > > __attribute__((__noinline__)) > > void > > _M_release_last_use_cold() noexcept > > { _M_release_last_use(); } > > > > > > And then: > > > > template<> > > inline void > > _Sp_counted_base<_S_atomic>::_M_release() noexcept > > { > > _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count); > > #if ! _GLIBCXX_TSAN > > constexpr bool __lock_free > > = __atomic_always_lock_free(sizeof(long long), 0) > > && __atomic_always_lock_free(sizeof(_Atomic_word), 0); > > constexpr bool __double_word > > = sizeof(long long) == 2 * sizeof(_Atomic_word); > > // The ref-count members follow the vptr, so are aligned to > > // alignof(void*). > > constexpr bool __aligned = __alignof(long long) <= alignof(void*); > > if _GLIBCXX17_CONSTEXPR (__lock_free && __double_word && __aligned) > > { > > constexpr long long __unique_ref > > = 1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word))); > > auto __both_counts = reinterpret_cast(&_M_use_count); > > > > _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count); > > if (__atomic_load_n(__both_counts, __ATOMIC_ACQUIRE) == > > __unique_ref) > > { > > // Both counts are 1, so there are no weak references and > > // we are releasing the last strong reference. No other > > // threads can observe the effects of this _M_release() > > // call (e.g. calling use_count()) without a data race. > > *(long long*)(&_M_use_count) = 0; > > _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count); > > _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_c
Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
On Wed, Aug 4, 2021 at 3:32 PM Jonathan Wakely wrote: > On Wed, 4 Aug 2021 at 18:19, Maged Michael wrote: > > > > Sorry. I totally missed the rest of your message and the patch. My fuzzy > eyesight, which usually guesses correctly 90% of the time, mistook > "Secondly" on a line by itself for "Sincerely" :-) > > :-) > > > The noinlining was based on looking at generated code. That was for > clang. It was inlining the _M_last_use function for every instance of > _M_release (e.g., ~shared_ptr). This optimization with the noinline for > _M_release_last_use ended up reducing massive binary text sizes by 1.5% > (several megabytes)., which was an extra benefit. Without the noinline we > saw code size increase. > > Wow, that is a convincing argument for making it not inline, thanks. > > > IIUC, we van use the following. Right? > > > > __attribute__((__noinline__)) > > Right. > > > I didn't understand the part about programmers doing #define noinline 1. > I don't see code in the patch that uses noinline. > > This is a valid C++ program: > > #define noinline 1 > #include > int main() { } > > But if anything in uses "noinline" then this valid program > will not compile. Which is why we must use ((__noinline__)) instead of > ((noinline)). > > Thanks. Now I get it. > > > > > > How about something like this comment? > > > > // Noinline to avoid code size increase. > > Great, thanks. > > On Wed, 4 Aug 2021 at 18:34, Maged Michael wrote: > > Actually I take back what I said. Sorry. I think the logic in your patch > is correct. I missed some of the atomic decrements. > > But I'd be concerned about performance. If we make _M_release_last_use > noinline then we are adding overhead to the fast path of the original logic > (where both counts are 1). > > Oh, I see. So the code duplication serves a purpose. We want the > _M_release_last_use() code to be non-inline for the new logic, because > in the common case we never call it (we either detect that both counts > are 1 and do the dispose & destroy without atomic decrements, or we do > a single decrement and don't dispose or destroy anything). But for the > old logic, we do want that code inlined into _M_release (or > _M_release_orig as it was called in your patch). Have I got that right > now? > > Yes. Completely right. > What if we remove the __noinline__ from _M_release_last_use() so that > it can be inlined, but than add a noinline wrapper that can be called > when we don't want to inline it? > > So: > // Called by _M_release() when the use count reaches zero. > void > _M_release_last_use() noexcept > { > // unchanged from previous patch, but without the attribute. > // ... > } > > // As above, but 'noinline' to reduce code size on the cold path. > __attribute__((__noinline__)) > void > _M_release_last_use_cold() noexcept > { _M_release_last_use(); } > > > And then: > > template<> > inline void > _Sp_counted_base<_S_atomic>::_M_release() noexcept > { > _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count); > #if ! _GLIBCXX_TSAN > constexpr bool __lock_free > = __atomic_always_lock_free(sizeof(long long), 0) > && __atomic_always_lock_free(sizeof(_Atomic_word), 0); > constexpr bool __double_word > = sizeof(long long) == 2 * sizeof(_Atomic_word); > // The ref-count members follow the vptr, so are aligned to > // alignof(void*). > constexpr bool __aligned = __alignof(long long) <= alignof(void*); > if _GLIBCXX17_CONSTEXPR (__lock_free && __double_word && __aligned) > { > constexpr long long __unique_ref > = 1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word))); > auto __both_counts = reinterpret_cast(&_M_use_count); > > _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count); > if (__atomic_load_n(__both_counts, __ATOMIC_ACQUIRE) == > __unique_ref) > { > // Both counts are 1, so there are no weak references and > // we are releasing the last strong reference. No other > // threads can observe the effects of this _M_release() > // call (e.g. calling use_count()) without a data race. > *(long long*)(&_M_use_count) = 0; > _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count); > _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count); > _M_dispose(); > _M_destroy(); > return; > } > if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == > 1) > { > _M_release_last_use_cold(); > return; > } > } > else > #endif > if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1) > { > _M_release_last_use(); > } > } > > > So we use the noinline version for the else branch in the new logi
Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
On Wed, 4 Aug 2021 at 18:19, Maged Michael wrote: > > Sorry. I totally missed the rest of your message and the patch. My fuzzy > eyesight, which usually guesses correctly 90% of the time, mistook "Secondly" > on a line by itself for "Sincerely" :-) :-) > The noinlining was based on looking at generated code. That was for clang. It > was inlining the _M_last_use function for every instance of _M_release (e.g., > ~shared_ptr). This optimization with the noinline for _M_release_last_use > ended up reducing massive binary text sizes by 1.5% (several megabytes)., > which was an extra benefit. Without the noinline we saw code size increase. Wow, that is a convincing argument for making it not inline, thanks. > IIUC, we van use the following. Right? > > __attribute__((__noinline__)) Right. > I didn't understand the part about programmers doing #define noinline 1. I > don't see code in the patch that uses noinline. This is a valid C++ program: #define noinline 1 #include int main() { } But if anything in uses "noinline" then this valid program will not compile. Which is why we must use ((__noinline__)) instead of ((noinline)). > > How about something like this comment? > > // Noinline to avoid code size increase. Great, thanks. On Wed, 4 Aug 2021 at 18:34, Maged Michael wrote: > Actually I take back what I said. Sorry. I think the logic in your patch is > correct. I missed some of the atomic decrements. > But I'd be concerned about performance. If we make _M_release_last_use > noinline then we are adding overhead to the fast path of the original logic > (where both counts are 1). Oh, I see. So the code duplication serves a purpose. We want the _M_release_last_use() code to be non-inline for the new logic, because in the common case we never call it (we either detect that both counts are 1 and do the dispose & destroy without atomic decrements, or we do a single decrement and don't dispose or destroy anything). But for the old logic, we do want that code inlined into _M_release (or _M_release_orig as it was called in your patch). Have I got that right now? What if we remove the __noinline__ from _M_release_last_use() so that it can be inlined, but than add a noinline wrapper that can be called when we don't want to inline it? So: // Called by _M_release() when the use count reaches zero. void _M_release_last_use() noexcept { // unchanged from previous patch, but without the attribute. // ... } // As above, but 'noinline' to reduce code size on the cold path. __attribute__((__noinline__)) void _M_release_last_use_cold() noexcept { _M_release_last_use(); } And then: template<> inline void _Sp_counted_base<_S_atomic>::_M_release() noexcept { _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count); #if ! _GLIBCXX_TSAN constexpr bool __lock_free = __atomic_always_lock_free(sizeof(long long), 0) && __atomic_always_lock_free(sizeof(_Atomic_word), 0); constexpr bool __double_word = sizeof(long long) == 2 * sizeof(_Atomic_word); // The ref-count members follow the vptr, so are aligned to // alignof(void*). constexpr bool __aligned = __alignof(long long) <= alignof(void*); if _GLIBCXX17_CONSTEXPR (__lock_free && __double_word && __aligned) { constexpr long long __unique_ref = 1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word))); auto __both_counts = reinterpret_cast(&_M_use_count); _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count); if (__atomic_load_n(__both_counts, __ATOMIC_ACQUIRE) == __unique_ref) { // Both counts are 1, so there are no weak references and // we are releasing the last strong reference. No other // threads can observe the effects of this _M_release() // call (e.g. calling use_count()) without a data race. *(long long*)(&_M_use_count) = 0; _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count); _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count); _M_dispose(); _M_destroy(); return; } if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1) { _M_release_last_use_cold(); return; } } else #endif if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1) { _M_release_last_use(); } } So we use the noinline version for the else branch in the new logic, but the can-inline version for the old logic. Would that work? We could also consider adding __attribute__((__cold__)) to the _M_release_last_use_cold() function, and/or add [[__unlikely__]] to the 'if' that calls it, but maybe that's overkill. It seems like this will slightly pessimize the case where the last use coun
Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
On Wed, Aug 4, 2021 at 1:19 PM Maged Michael wrote: > Sorry. I totally missed the rest of your message and the patch. My fuzzy > eyesight, which usually guesses correctly 90% of the time, mistook > "Secondly" on a line by itself for "Sincerely" :-) > > On Wed, Aug 4, 2021 at 11:32 AM Jonathan Wakely > wrote: > >> On Tue, 3 Aug 2021 at 21:59, Jonathan Wakely wrote: >> > >> > On Mon, 2 Aug 2021 at 14:29, Maged Michael wrote: >> > > >> > > This is the right patch. The previous one is missing noexcept. Sorry. >> > > >> > > >> > > On Mon, Aug 2, 2021 at 9:23 AM Maged Michael >> wrote: >> > >> >> > >> Please find attached an updated patch after incorporating Jonathan's >> suggestions. >> > >> >> > >> Changes from the last patch include: >> > >> - Add a TSAN macro to bits/c++config. >> > >> - Use separate constexpr bool-s for the conditions for lock-freedom, >> double-width and alignment. >> > >> - Move the code in the optimized path to a separate function >> _M_release_double_width_cas. >> > >> > Thanks for the updated patch. At a quick glance it looks great. I'll >> > apply it locally and test it tomorrow. >> >> >> It occurs to me now that _M_release_double_width_cas is the wrong >> name, because we're not doing a DWCAS, just a double-width load. What >> do you think about renaming it to _M_release_combined instead? Since >> it does a combined load of the two counts. >> >> I'm no longer confident in the alignof suggestion I gave you. >> >> +constexpr bool __double_word >> + = sizeof(long long) == 2 * sizeof(_Atomic_word); >> +// The ref-count members follow the vptr, so are aligned to >> +// alignof(void*). >> +constexpr bool __aligned = alignof(long long) <= alignof(void*); >> >> For IA32 (i.e. 32-bit x86) this constant will be true, because >> alignof(long long) and alignof(void*) are both 4, even though >> sizeof(long long) is 8. So in theory the _M_use_count and >> _M_weak_count members could be in different cache lines, and the >> atomic load will not be atomic. I think we want to use __alignof(long >> long) here to get the "natural" alignment, not the smaller 4B >> alignment allowed by SysV ABI. That means that we won't do this >> optimization on IA32, but I think that's acceptable. >> >> Alternatively, we could keep using alignof, but with an additional >> run-time check something like (uintptr_t)&_M_use_count / 64 == >> (uintptr_t)&_M_weak_count / 64 to check they're on the same cache >> line. I think for now we should just use __alignof and live without >> the optimization on IA32. >> >> Secondly: >> >> + void >> + __attribute__ ((noinline)) >> >> This needs to be __noinline__ because noinline is not a reserved word, >> so users can do: >> #define noinline 1 >> #include >> >> Was it performance profiling, or code-size measurements (or something >> else?) that showed this should be non-inline? >> I'd like to add a comment explaining why we want it to be noinline. >> >> The noinlining was based on looking at generated code. That was for > clang. It was inlining the _M_last_use function for every instance of > _M_release (e.g., ~shared_ptr). This optimization with the noinline for > _M_release_last_use ended up reducing massive binary text sizes by 1.5% > (several megabytes)., which was an extra benefit. Without the noinline we > saw code size increase. > > IIUC, we van use the following. Right? > > __attribute__((__noinline__)) > > > I didn't understand the part about programmers doing #define noinline 1. > I don't see code in the patch that uses noinline. > > How about something like this comment? > > // Noinline to avoid code size increase. > > > > In that function ... >> >> + _M_release_last_use() noexcept >> + { >> +_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count); >> +_M_dispose(); >> +if (_Mutex_base<_Lp>::_S_need_barriers) >> + { >> +__atomic_thread_fence (__ATOMIC_ACQ_REL); >> + } >> >> I think we can remove this fence. The _S_need_barriers constant is >> only true for the _S_mutex policy, and that just calls >> _M_release_orig(), so never uses this function. I'll remove it and add >> a comment noting that the lack of barrier is intentional. >> >> +_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count); >> +if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count, >> + -1) == 1) >> + { >> +_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count); >> +_M_destroy(); >> + } >> + } >> >> Alternatively, we could keep the fence in _M_release_last_use() and >> refactor the other _M_release functions, so that we have explicit >> specializations for each of: >> _Sp_counted_base<_S_single>::_M_release() (already present) >> _Sp_counted_base<_S_mutex>::_M_release() >> _Sp_counted_base<_S_atomic>::_M_release() >> >> The second and third would be new, as currently they both u
Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
Sorry. I totally missed the rest of your message and the patch. My fuzzy eyesight, which usually guesses correctly 90% of the time, mistook "Secondly" on a line by itself for "Sincerely" :-) On Wed, Aug 4, 2021 at 11:32 AM Jonathan Wakely wrote: > On Tue, 3 Aug 2021 at 21:59, Jonathan Wakely wrote: > > > > On Mon, 2 Aug 2021 at 14:29, Maged Michael wrote: > > > > > > This is the right patch. The previous one is missing noexcept. Sorry. > > > > > > > > > On Mon, Aug 2, 2021 at 9:23 AM Maged Michael > wrote: > > >> > > >> Please find attached an updated patch after incorporating Jonathan's > suggestions. > > >> > > >> Changes from the last patch include: > > >> - Add a TSAN macro to bits/c++config. > > >> - Use separate constexpr bool-s for the conditions for lock-freedom, > double-width and alignment. > > >> - Move the code in the optimized path to a separate function > _M_release_double_width_cas. > > > > Thanks for the updated patch. At a quick glance it looks great. I'll > > apply it locally and test it tomorrow. > > > It occurs to me now that _M_release_double_width_cas is the wrong > name, because we're not doing a DWCAS, just a double-width load. What > do you think about renaming it to _M_release_combined instead? Since > it does a combined load of the two counts. > > I'm no longer confident in the alignof suggestion I gave you. > > +constexpr bool __double_word > + = sizeof(long long) == 2 * sizeof(_Atomic_word); > +// The ref-count members follow the vptr, so are aligned to > +// alignof(void*). > +constexpr bool __aligned = alignof(long long) <= alignof(void*); > > For IA32 (i.e. 32-bit x86) this constant will be true, because > alignof(long long) and alignof(void*) are both 4, even though > sizeof(long long) is 8. So in theory the _M_use_count and > _M_weak_count members could be in different cache lines, and the > atomic load will not be atomic. I think we want to use __alignof(long > long) here to get the "natural" alignment, not the smaller 4B > alignment allowed by SysV ABI. That means that we won't do this > optimization on IA32, but I think that's acceptable. > > Alternatively, we could keep using alignof, but with an additional > run-time check something like (uintptr_t)&_M_use_count / 64 == > (uintptr_t)&_M_weak_count / 64 to check they're on the same cache > line. I think for now we should just use __alignof and live without > the optimization on IA32. > > Secondly: > > + void > + __attribute__ ((noinline)) > > This needs to be __noinline__ because noinline is not a reserved word, > so users can do: > #define noinline 1 > #include > > Was it performance profiling, or code-size measurements (or something > else?) that showed this should be non-inline? > I'd like to add a comment explaining why we want it to be noinline. > > The noinlining was based on looking at generated code. That was for clang. It was inlining the _M_last_use function for every instance of _M_release (e.g., ~shared_ptr). This optimization with the noinline for _M_release_last_use ended up reducing massive binary text sizes by 1.5% (several megabytes)., which was an extra benefit. Without the noinline we saw code size increase. IIUC, we van use the following. Right? __attribute__((__noinline__)) I didn't understand the part about programmers doing #define noinline 1. I don't see code in the patch that uses noinline. How about something like this comment? // Noinline to avoid code size increase. In that function ... > > + _M_release_last_use() noexcept > + { > +_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count); > +_M_dispose(); > +if (_Mutex_base<_Lp>::_S_need_barriers) > + { > +__atomic_thread_fence (__ATOMIC_ACQ_REL); > + } > > I think we can remove this fence. The _S_need_barriers constant is > only true for the _S_mutex policy, and that just calls > _M_release_orig(), so never uses this function. I'll remove it and add > a comment noting that the lack of barrier is intentional. > > +_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count); > +if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count, > + -1) == 1) > + { > +_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count); > +_M_destroy(); > + } > + } > > Alternatively, we could keep the fence in _M_release_last_use() and > refactor the other _M_release functions, so that we have explicit > specializations for each of: > _Sp_counted_base<_S_single>::_M_release() (already present) > _Sp_counted_base<_S_mutex>::_M_release() > _Sp_counted_base<_S_atomic>::_M_release() > > The second and third would be new, as currently they both use the > definition in the primary template. The _S_mutex one would just > decrement _M_use_count then call _M_release_last_use() (which still > has the barrier needed for the _S_mutex policy). The
Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
On Wed, 4 Aug 2021 at 16:47, Maged Michael wrote: > > Thanks, Jonathan! > > On Wed, Aug 4, 2021 at 11:32 AM Jonathan Wakely wrote: >> >> On Tue, 3 Aug 2021 at 21:59, Jonathan Wakely wrote: >> > >> > On Mon, 2 Aug 2021 at 14:29, Maged Michael wrote: >> > > >> > > This is the right patch. The previous one is missing noexcept. Sorry. >> > > >> > > >> > > On Mon, Aug 2, 2021 at 9:23 AM Maged Michael >> > > wrote: >> > >> >> > >> Please find attached an updated patch after incorporating Jonathan's >> > >> suggestions. >> > >> >> > >> Changes from the last patch include: >> > >> - Add a TSAN macro to bits/c++config. >> > >> - Use separate constexpr bool-s for the conditions for lock-freedom, >> > >> double-width and alignment. >> > >> - Move the code in the optimized path to a separate function >> > >> _M_release_double_width_cas. >> > >> > Thanks for the updated patch. At a quick glance it looks great. I'll >> > apply it locally and test it tomorrow. >> >> >> It occurs to me now that _M_release_double_width_cas is the wrong >> name, because we're not doing a DWCAS, just a double-width load. What >> do you think about renaming it to _M_release_combined instead? Since >> it does a combined load of the two counts. > > > Yes definitely _M_release_combined makes sense. Will do. See the patch I attached to my previous mail, which has a refactored version that gets rid of that function entirely. > >> >> I'm no longer confident in the alignof suggestion I gave you. >> >> +constexpr bool __double_word >> + = sizeof(long long) == 2 * sizeof(_Atomic_word); >> +// The ref-count members follow the vptr, so are aligned to >> +// alignof(void*). >> +constexpr bool __aligned = alignof(long long) <= alignof(void*); >> >> For IA32 (i.e. 32-bit x86) this constant will be true, because >> alignof(long long) and alignof(void*) are both 4, even though >> sizeof(long long) is 8. So in theory the _M_use_count and >> _M_weak_count members could be in different cache lines, and the >> atomic load will not be atomic. I think we want to use __alignof(long >> long) here to get the "natural" alignment, not the smaller 4B >> alignment allowed by SysV ABI. That means that we won't do this >> optimization on IA32, but I think that's acceptable. >> >> Alternatively, we could keep using alignof, but with an additional >> run-time check something like (uintptr_t)&_M_use_count / 64 == >> (uintptr_t)&_M_weak_count / 64 to check they're on the same cache >> line. I think for now we should just use __alignof and live without >> the optimization on IA32. >> > I'd rather avoid any runtime checks because they may negate the speed > rationale for doing this optimization. > I'd be OK with not doing this optimization for any 32-bit architecture. > > Is it OK to change the __align condition to the following? > constexpr bool __aligned = > (alignof(long long) <= alignof(void*)) > && (sizeof(long long) == sizeof(void*)); Yes, that will work fine.
Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
Thanks, Jonathan! On Wed, Aug 4, 2021 at 11:32 AM Jonathan Wakely wrote: > On Tue, 3 Aug 2021 at 21:59, Jonathan Wakely wrote: > > > > On Mon, 2 Aug 2021 at 14:29, Maged Michael wrote: > > > > > > This is the right patch. The previous one is missing noexcept. Sorry. > > > > > > > > > On Mon, Aug 2, 2021 at 9:23 AM Maged Michael > wrote: > > >> > > >> Please find attached an updated patch after incorporating Jonathan's > suggestions. > > >> > > >> Changes from the last patch include: > > >> - Add a TSAN macro to bits/c++config. > > >> - Use separate constexpr bool-s for the conditions for lock-freedom, > double-width and alignment. > > >> - Move the code in the optimized path to a separate function > _M_release_double_width_cas. > > > > Thanks for the updated patch. At a quick glance it looks great. I'll > > apply it locally and test it tomorrow. > > > It occurs to me now that _M_release_double_width_cas is the wrong > name, because we're not doing a DWCAS, just a double-width load. What > do you think about renaming it to _M_release_combined instead? Since > it does a combined load of the two counts. > Yes definitely _M_release_combined makes sense. Will do. > I'm no longer confident in the alignof suggestion I gave you. > > +constexpr bool __double_word > + = sizeof(long long) == 2 * sizeof(_Atomic_word); > +// The ref-count members follow the vptr, so are aligned to > +// alignof(void*). > +constexpr bool __aligned = alignof(long long) <= alignof(void*); > > For IA32 (i.e. 32-bit x86) this constant will be true, because > alignof(long long) and alignof(void*) are both 4, even though > sizeof(long long) is 8. So in theory the _M_use_count and > _M_weak_count members could be in different cache lines, and the > atomic load will not be atomic. I think we want to use __alignof(long > long) here to get the "natural" alignment, not the smaller 4B > alignment allowed by SysV ABI. That means that we won't do this > optimization on IA32, but I think that's acceptable. > > Alternatively, we could keep using alignof, but with an additional > run-time check something like (uintptr_t)&_M_use_count / 64 == > (uintptr_t)&_M_weak_count / 64 to check they're on the same cache > line. I think for now we should just use __alignof and live without > the optimization on IA32. > > I'd rather avoid any runtime checks because they may negate the speed rationale for doing this optimization. I'd be OK with not doing this optimization for any 32-bit architecture. Is it OK to change the __align condition to the following? constexpr bool __aligned = (alignof(long long) <= alignof(void*)) && (sizeof(long long) == sizeof(void*)); Thanks, Maged > Secondly: > > + void > + __attribute__ ((noinline)) > > This needs to be __noinline__ because noinline is not a reserved word, > so users can do: > #define noinline 1 > #include > > Was it performance profiling, or code-size measurements (or something > else?) that showed this should be non-inline? > I'd like to add a comment explaining why we want it to be noinline. > > In that function ... > > + _M_release_last_use() noexcept > + { > +_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count); > +_M_dispose(); > +if (_Mutex_base<_Lp>::_S_need_barriers) > + { > +__atomic_thread_fence (__ATOMIC_ACQ_REL); > + } > > I think we can remove this fence. The _S_need_barriers constant is > only true for the _S_mutex policy, and that just calls > _M_release_orig(), so never uses this function. I'll remove it and add > a comment noting that the lack of barrier is intentional. > > +_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count); > +if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count, > + -1) == 1) > + { > +_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count); > +_M_destroy(); > + } > + } > > Alternatively, we could keep the fence in _M_release_last_use() and > refactor the other _M_release functions, so that we have explicit > specializations for each of: > _Sp_counted_base<_S_single>::_M_release() (already present) > _Sp_counted_base<_S_mutex>::_M_release() > _Sp_counted_base<_S_atomic>::_M_release() > > The second and third would be new, as currently they both use the > definition in the primary template. The _S_mutex one would just > decrement _M_use_count then call _M_release_last_use() (which still > has the barrier needed for the _S_mutex policy). The _S_atomic one > would have your new optimization. See the attached patch showing what > I mean. I find this version a bit simpler to understand, as we just > have _M_release and _M_release_last_use, without > _M_release_double_width_cas and _M_release_orig. What do you think of > this version? Does it lose any important properties of your version > which I've failed to notice? >
Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
On Tue, 3 Aug 2021 at 21:59, Jonathan Wakely wrote: > > On Mon, 2 Aug 2021 at 14:29, Maged Michael wrote: > > > > This is the right patch. The previous one is missing noexcept. Sorry. > > > > > > On Mon, Aug 2, 2021 at 9:23 AM Maged Michael > > wrote: > >> > >> Please find attached an updated patch after incorporating Jonathan's > >> suggestions. > >> > >> Changes from the last patch include: > >> - Add a TSAN macro to bits/c++config. > >> - Use separate constexpr bool-s for the conditions for lock-freedom, > >> double-width and alignment. > >> - Move the code in the optimized path to a separate function > >> _M_release_double_width_cas. > > Thanks for the updated patch. At a quick glance it looks great. I'll > apply it locally and test it tomorrow. It occurs to me now that _M_release_double_width_cas is the wrong name, because we're not doing a DWCAS, just a double-width load. What do you think about renaming it to _M_release_combined instead? Since it does a combined load of the two counts. I'm no longer confident in the alignof suggestion I gave you. +constexpr bool __double_word + = sizeof(long long) == 2 * sizeof(_Atomic_word); +// The ref-count members follow the vptr, so are aligned to +// alignof(void*). +constexpr bool __aligned = alignof(long long) <= alignof(void*); For IA32 (i.e. 32-bit x86) this constant will be true, because alignof(long long) and alignof(void*) are both 4, even though sizeof(long long) is 8. So in theory the _M_use_count and _M_weak_count members could be in different cache lines, and the atomic load will not be atomic. I think we want to use __alignof(long long) here to get the "natural" alignment, not the smaller 4B alignment allowed by SysV ABI. That means that we won't do this optimization on IA32, but I think that's acceptable. Alternatively, we could keep using alignof, but with an additional run-time check something like (uintptr_t)&_M_use_count / 64 == (uintptr_t)&_M_weak_count / 64 to check they're on the same cache line. I think for now we should just use __alignof and live without the optimization on IA32. Secondly: + void + __attribute__ ((noinline)) This needs to be __noinline__ because noinline is not a reserved word, so users can do: #define noinline 1 #include Was it performance profiling, or code-size measurements (or something else?) that showed this should be non-inline? I'd like to add a comment explaining why we want it to be noinline. In that function ... + _M_release_last_use() noexcept + { +_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count); +_M_dispose(); +if (_Mutex_base<_Lp>::_S_need_barriers) + { +__atomic_thread_fence (__ATOMIC_ACQ_REL); + } I think we can remove this fence. The _S_need_barriers constant is only true for the _S_mutex policy, and that just calls _M_release_orig(), so never uses this function. I'll remove it and add a comment noting that the lack of barrier is intentional. +_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count); +if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count, + -1) == 1) + { +_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count); +_M_destroy(); + } + } Alternatively, we could keep the fence in _M_release_last_use() and refactor the other _M_release functions, so that we have explicit specializations for each of: _Sp_counted_base<_S_single>::_M_release() (already present) _Sp_counted_base<_S_mutex>::_M_release() _Sp_counted_base<_S_atomic>::_M_release() The second and third would be new, as currently they both use the definition in the primary template. The _S_mutex one would just decrement _M_use_count then call _M_release_last_use() (which still has the barrier needed for the _S_mutex policy). The _S_atomic one would have your new optimization. See the attached patch showing what I mean. I find this version a bit simpler to understand, as we just have _M_release and _M_release_last_use, without _M_release_double_width_cas and _M_release_orig. What do you think of this version? Does it lose any important properties of your version which I've failed to notice? diff --git a/libstdc++-v3/include/bits/c++config b/libstdc++-v3/include/bits/c++config index 32b8957f814..07465f7ecd5 100644 --- a/libstdc++-v3/include/bits/c++config +++ b/libstdc++-v3/include/bits/c++config @@ -143,6 +143,15 @@ # define _GLIBCXX_NODISCARD #endif +// Macro for TSAN. +#if __SANITIZE_THREAD__ +# define _GLIBCXX_TSAN 1 +#elif defined __has_feature +# if __has_feature(thread_sanitizer) +# define _GLIBCXX_TSAN 1 +# endif +#endif + #if __cplusplus diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h b/libstdc++-v3/include/bits/shared_ptr_base.h index 5be935d174d..b2397c8fddb 100644 --- a/libstdc++-v3/include/bits/shared_ptr_base.h +++ b/libstdc++-v3/include/bits/shared_
Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
On Mon, 2 Aug 2021 at 14:29, Maged Michael wrote: > > This is the right patch. The previous one is missing noexcept. Sorry. > > > On Mon, Aug 2, 2021 at 9:23 AM Maged Michael wrote: >> >> Please find attached an updated patch after incorporating Jonathan's >> suggestions. >> >> Changes from the last patch include: >> - Add a TSAN macro to bits/c++config. >> - Use separate constexpr bool-s for the conditions for lock-freedom, >> double-width and alignment. >> - Move the code in the optimized path to a separate function >> _M_release_double_width_cas. Thanks for the updated patch. At a quick glance it looks great. I'll apply it locally and test it tomorrow.
Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
This is the right patch. The previous one is missing noexcept. Sorry. On Mon, Aug 2, 2021 at 9:23 AM Maged Michael wrote: > Please find attached an updated patch after incorporating Jonathan's > suggestions. > > Changes from the last patch include: > - Add a TSAN macro to bits/c++config. > - Use separate constexpr bool-s for the conditions for lock-freedom, > double-width and alignment. > - Move the code in the optimized path to a separate function > _M_release_double_width_cas. > > Thanks, > Maged > > > On Fri, Jul 16, 2021 at 11:55 AM Maged Michael > wrote: > >> Thank you, Jonathan, for the detailed comments! I'll update the patch >> accordingly. >> >> On Fri, Jul 16, 2021 at 9:55 AM Jonathan Wakely >> wrote: >> >>> On Thu, 17 Dec 2020 at 20:51, Maged Michael wrote: >>> > >>> > Please find a proposed patch for _Sp_counted_base::_M_release to skip >>> the >>> > two atomic instructions that decrement each of the use count and the >>> weak >>> > count when both are 1. I proposed the general idea in an earlier >>> thread ( >>> > https://gcc.gnu.org/pipermail/libstdc++/2020-December/051642.html) >>> and got >>> > useful feedback on a draft patch and responses to related questions >>> about >>> > multi-granular atomicity and alignment. This patch is based on that >>> > feedback. >>> > >>> > >>> > I added a check for thread sanitizer to use the current algorithm in >>> that >>> > case because TSAN does not support multi-granular atomicity. I'd like >>> to >>> > add a check of __has_feature(thread_sanitizer) for building using >>> LLVM. I >>> > found examples of __has_feature in libstdc++ >>> >>> There are no uses of __has_feature in libstdc++. We do use >>> __has_builtin (which GCC also supports) and Clang's __is_identifier >>> (which GCC doesn't support) to work around some weird semantics of >>> __has_builtin in older versions of Clang. >>> >>> >>> > but it doesn't seem to be >>> > recognized in shared_ptr_base.h. Any guidance on how to check >>> > __has_feature(thread_sanitizer) in this patch? >>> >>> I think we want to do something like this in include/bits/c++config >>> >>> #if __SANITIZE_THREAD__ >>> # define _GLIBCXX_TSAN 1 >>> #elif defined __has_feature >>> # if __has_feature(thread_sanitizer) >>> # define _GLIBCXX_TSAN 1 >>> # endif >>> #endif >>> >>> Then in bits/shared_ptr_base.h >>> >>> #if _GLIBCXX_TSAN >>> _M_release_orig(); >>> return; >>> #endif >>> >>> >>> >>> > GCC generates code for _M_release that is larger and more complex than >>> that >>> > generated by LLVM. I'd like to file a bug report about that. Jonathan, >>> >>> Is this the same issue as https://gcc.gnu.org/PR101406 ? >>> >>> Partly yes. Even when using __atomic_add_dispatch I noticed that clang >> generated less code than gcc. I see in the response to the issue that the >> new glibc is expected to optimize better. So maybe this will eliminate the >> issue. >> >> >>> > would you please create a bugzilla account for me ( >>> > https://gcc.gnu.org/bugzilla/) using my gmail address. Thank you. >>> >>> Done (sorry, I didn't notice the request in this mail until coming >>> back to it to review the patch properly). >>> >>> Thank you! >> >> >>> >>> >>> > >>> > >>> > Information about the patch: >>> > >>> > - Benefits of the patch: Save the cost of the last atomic decrements of >>> > each of the use count and the weak count in _Sp_counted_base. Atomic >>> > instructions are significantly slower than regular loads and stores >>> across >>> > major architectures. >>> > >>> > - How current code works: _M_release() atomically decrements the use >>> count, >>> > checks if it was 1, if so calls _M_dispose(), atomically decrements the >>> > weak count, checks if it was 1, and if so calls _M_destroy(). >>> > >>> > - How the proposed patch works: _M_release() loads both use count and >>> weak >>> > count together atomically (when properly aligned), checks if the value >>> is >>> > equal to the value of both counts equal to 1 (e.g., 0x10001), and >>> if so >>> > calls _M_dispose() and _M_destroy(). Otherwise, it follows the original >>> > algorithm. >>> > >>> > - Why it works: When the current thread executing _M_release() finds >>> each >>> > of the counts is equal to 1, then (when _lock_policy is _S_atomic) no >>> other >>> > threads could possibly hold use or weak references to this control >>> block. >>> > That is, no other threads could possibly access the counts or the >>> protected >>> > object. >>> > >>> > - The proposed patch is intended to interact correctly with current >>> code >>> > (under certain conditions: _Lock_policy is _S_atomic, proper >>> alignment, and >>> > native lock-free support for atomic operations). That is, multiple >>> threads >>> > using different versions of the code with and without the patch >>> operating >>> > on the same objects should always interact correctly. The intent for >>> the >>> > patch is to be ABI compatible with the current implementation. >>> > >>> > - The proposed patch
Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
Please find attached an updated patch after incorporating Jonathan's suggestions. Changes from the last patch include: - Add a TSAN macro to bits/c++config. - Use separate constexpr bool-s for the conditions for lock-freedom, double-width and alignment. - Move the code in the optimized path to a separate function _M_release_double_width_cas. Thanks, Maged On Fri, Jul 16, 2021 at 11:55 AM Maged Michael wrote: > Thank you, Jonathan, for the detailed comments! I'll update the patch > accordingly. > > On Fri, Jul 16, 2021 at 9:55 AM Jonathan Wakely > wrote: > >> On Thu, 17 Dec 2020 at 20:51, Maged Michael wrote: >> > >> > Please find a proposed patch for _Sp_counted_base::_M_release to skip >> the >> > two atomic instructions that decrement each of the use count and the >> weak >> > count when both are 1. I proposed the general idea in an earlier thread >> ( >> > https://gcc.gnu.org/pipermail/libstdc++/2020-December/051642.html) and >> got >> > useful feedback on a draft patch and responses to related questions >> about >> > multi-granular atomicity and alignment. This patch is based on that >> > feedback. >> > >> > >> > I added a check for thread sanitizer to use the current algorithm in >> that >> > case because TSAN does not support multi-granular atomicity. I'd like to >> > add a check of __has_feature(thread_sanitizer) for building using LLVM. >> I >> > found examples of __has_feature in libstdc++ >> >> There are no uses of __has_feature in libstdc++. We do use >> __has_builtin (which GCC also supports) and Clang's __is_identifier >> (which GCC doesn't support) to work around some weird semantics of >> __has_builtin in older versions of Clang. >> >> >> > but it doesn't seem to be >> > recognized in shared_ptr_base.h. Any guidance on how to check >> > __has_feature(thread_sanitizer) in this patch? >> >> I think we want to do something like this in include/bits/c++config >> >> #if __SANITIZE_THREAD__ >> # define _GLIBCXX_TSAN 1 >> #elif defined __has_feature >> # if __has_feature(thread_sanitizer) >> # define _GLIBCXX_TSAN 1 >> # endif >> #endif >> >> Then in bits/shared_ptr_base.h >> >> #if _GLIBCXX_TSAN >> _M_release_orig(); >> return; >> #endif >> >> >> >> > GCC generates code for _M_release that is larger and more complex than >> that >> > generated by LLVM. I'd like to file a bug report about that. Jonathan, >> >> Is this the same issue as https://gcc.gnu.org/PR101406 ? >> >> Partly yes. Even when using __atomic_add_dispatch I noticed that clang > generated less code than gcc. I see in the response to the issue that the > new glibc is expected to optimize better. So maybe this will eliminate the > issue. > > >> > would you please create a bugzilla account for me ( >> > https://gcc.gnu.org/bugzilla/) using my gmail address. Thank you. >> >> Done (sorry, I didn't notice the request in this mail until coming >> back to it to review the patch properly). >> >> Thank you! > > >> >> >> > >> > >> > Information about the patch: >> > >> > - Benefits of the patch: Save the cost of the last atomic decrements of >> > each of the use count and the weak count in _Sp_counted_base. Atomic >> > instructions are significantly slower than regular loads and stores >> across >> > major architectures. >> > >> > - How current code works: _M_release() atomically decrements the use >> count, >> > checks if it was 1, if so calls _M_dispose(), atomically decrements the >> > weak count, checks if it was 1, and if so calls _M_destroy(). >> > >> > - How the proposed patch works: _M_release() loads both use count and >> weak >> > count together atomically (when properly aligned), checks if the value >> is >> > equal to the value of both counts equal to 1 (e.g., 0x10001), and >> if so >> > calls _M_dispose() and _M_destroy(). Otherwise, it follows the original >> > algorithm. >> > >> > - Why it works: When the current thread executing _M_release() finds >> each >> > of the counts is equal to 1, then (when _lock_policy is _S_atomic) no >> other >> > threads could possibly hold use or weak references to this control >> block. >> > That is, no other threads could possibly access the counts or the >> protected >> > object. >> > >> > - The proposed patch is intended to interact correctly with current code >> > (under certain conditions: _Lock_policy is _S_atomic, proper alignment, >> and >> > native lock-free support for atomic operations). That is, multiple >> threads >> > using different versions of the code with and without the patch >> operating >> > on the same objects should always interact correctly. The intent for the >> > patch is to be ABI compatible with the current implementation. >> > >> > - The proposed patch involves a performance trade-off between saving the >> > costs of two atomic instructions when the counts are both 1 vs adding >> the >> > cost of loading the combined counts and comparison with two ones (e.g., >> > 0x10001). >> > >> > - The patch has been in use (built using LLVM) in a l
Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
Thank you, Jonathan, for the detailed comments! I'll update the patch accordingly. On Fri, Jul 16, 2021 at 9:55 AM Jonathan Wakely wrote: > On Thu, 17 Dec 2020 at 20:51, Maged Michael wrote: > > > > Please find a proposed patch for _Sp_counted_base::_M_release to skip the > > two atomic instructions that decrement each of the use count and the weak > > count when both are 1. I proposed the general idea in an earlier thread ( > > https://gcc.gnu.org/pipermail/libstdc++/2020-December/051642.html) and > got > > useful feedback on a draft patch and responses to related questions about > > multi-granular atomicity and alignment. This patch is based on that > > feedback. > > > > > > I added a check for thread sanitizer to use the current algorithm in that > > case because TSAN does not support multi-granular atomicity. I'd like to > > add a check of __has_feature(thread_sanitizer) for building using LLVM. I > > found examples of __has_feature in libstdc++ > > There are no uses of __has_feature in libstdc++. We do use > __has_builtin (which GCC also supports) and Clang's __is_identifier > (which GCC doesn't support) to work around some weird semantics of > __has_builtin in older versions of Clang. > > > > but it doesn't seem to be > > recognized in shared_ptr_base.h. Any guidance on how to check > > __has_feature(thread_sanitizer) in this patch? > > I think we want to do something like this in include/bits/c++config > > #if __SANITIZE_THREAD__ > # define _GLIBCXX_TSAN 1 > #elif defined __has_feature > # if __has_feature(thread_sanitizer) > # define _GLIBCXX_TSAN 1 > # endif > #endif > > Then in bits/shared_ptr_base.h > > #if _GLIBCXX_TSAN > _M_release_orig(); > return; > #endif > > > > > GCC generates code for _M_release that is larger and more complex than > that > > generated by LLVM. I'd like to file a bug report about that. Jonathan, > > Is this the same issue as https://gcc.gnu.org/PR101406 ? > > Partly yes. Even when using __atomic_add_dispatch I noticed that clang generated less code than gcc. I see in the response to the issue that the new glibc is expected to optimize better. So maybe this will eliminate the issue. > > would you please create a bugzilla account for me ( > > https://gcc.gnu.org/bugzilla/) using my gmail address. Thank you. > > Done (sorry, I didn't notice the request in this mail until coming > back to it to review the patch properly). > > Thank you! > > > > > > > > Information about the patch: > > > > - Benefits of the patch: Save the cost of the last atomic decrements of > > each of the use count and the weak count in _Sp_counted_base. Atomic > > instructions are significantly slower than regular loads and stores > across > > major architectures. > > > > - How current code works: _M_release() atomically decrements the use > count, > > checks if it was 1, if so calls _M_dispose(), atomically decrements the > > weak count, checks if it was 1, and if so calls _M_destroy(). > > > > - How the proposed patch works: _M_release() loads both use count and > weak > > count together atomically (when properly aligned), checks if the value is > > equal to the value of both counts equal to 1 (e.g., 0x10001), and if > so > > calls _M_dispose() and _M_destroy(). Otherwise, it follows the original > > algorithm. > > > > - Why it works: When the current thread executing _M_release() finds each > > of the counts is equal to 1, then (when _lock_policy is _S_atomic) no > other > > threads could possibly hold use or weak references to this control block. > > That is, no other threads could possibly access the counts or the > protected > > object. > > > > - The proposed patch is intended to interact correctly with current code > > (under certain conditions: _Lock_policy is _S_atomic, proper alignment, > and > > native lock-free support for atomic operations). That is, multiple > threads > > using different versions of the code with and without the patch operating > > on the same objects should always interact correctly. The intent for the > > patch is to be ABI compatible with the current implementation. > > > > - The proposed patch involves a performance trade-off between saving the > > costs of two atomic instructions when the counts are both 1 vs adding the > > cost of loading the combined counts and comparison with two ones (e.g., > > 0x10001). > > > > - The patch has been in use (built using LLVM) in a large environment for > > many months. The performance gains outweigh the losses (roughly 10 to 1) > > across a large variety of workloads. > > > > > > I'd appreciate feedback on the patch and any suggestions for checking > > __has_feature(thread_sanitizer). > > N.B. gmail completely mangles patches unless you send them as attachments. > > > > diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h > > b/libstdc++-v3/include/bits/shared_ptr_base.h > > > > index 368b2d7379a..a8fc944af5f 100644 > > > > --- a/libstdc++-v3/include/bits/shared_ptr_base.h > > > > +++ b/libstdc++
Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
On Thu, 17 Dec 2020 at 20:51, Maged Michael wrote: > > Please find a proposed patch for _Sp_counted_base::_M_release to skip the > two atomic instructions that decrement each of the use count and the weak > count when both are 1. I proposed the general idea in an earlier thread ( > https://gcc.gnu.org/pipermail/libstdc++/2020-December/051642.html) and got > useful feedback on a draft patch and responses to related questions about > multi-granular atomicity and alignment. This patch is based on that > feedback. > > > I added a check for thread sanitizer to use the current algorithm in that > case because TSAN does not support multi-granular atomicity. I'd like to > add a check of __has_feature(thread_sanitizer) for building using LLVM. I > found examples of __has_feature in libstdc++ There are no uses of __has_feature in libstdc++. We do use __has_builtin (which GCC also supports) and Clang's __is_identifier (which GCC doesn't support) to work around some weird semantics of __has_builtin in older versions of Clang. > but it doesn't seem to be > recognized in shared_ptr_base.h. Any guidance on how to check > __has_feature(thread_sanitizer) in this patch? I think we want to do something like this in include/bits/c++config #if __SANITIZE_THREAD__ # define _GLIBCXX_TSAN 1 #elif defined __has_feature # if __has_feature(thread_sanitizer) # define _GLIBCXX_TSAN 1 # endif #endif Then in bits/shared_ptr_base.h #if _GLIBCXX_TSAN _M_release_orig(); return; #endif > GCC generates code for _M_release that is larger and more complex than that > generated by LLVM. I'd like to file a bug report about that. Jonathan, Is this the same issue as https://gcc.gnu.org/PR101406 ? > would you please create a bugzilla account for me ( > https://gcc.gnu.org/bugzilla/) using my gmail address. Thank you. Done (sorry, I didn't notice the request in this mail until coming back to it to review the patch properly). > > > Information about the patch: > > - Benefits of the patch: Save the cost of the last atomic decrements of > each of the use count and the weak count in _Sp_counted_base. Atomic > instructions are significantly slower than regular loads and stores across > major architectures. > > - How current code works: _M_release() atomically decrements the use count, > checks if it was 1, if so calls _M_dispose(), atomically decrements the > weak count, checks if it was 1, and if so calls _M_destroy(). > > - How the proposed patch works: _M_release() loads both use count and weak > count together atomically (when properly aligned), checks if the value is > equal to the value of both counts equal to 1 (e.g., 0x10001), and if so > calls _M_dispose() and _M_destroy(). Otherwise, it follows the original > algorithm. > > - Why it works: When the current thread executing _M_release() finds each > of the counts is equal to 1, then (when _lock_policy is _S_atomic) no other > threads could possibly hold use or weak references to this control block. > That is, no other threads could possibly access the counts or the protected > object. > > - The proposed patch is intended to interact correctly with current code > (under certain conditions: _Lock_policy is _S_atomic, proper alignment, and > native lock-free support for atomic operations). That is, multiple threads > using different versions of the code with and without the patch operating > on the same objects should always interact correctly. The intent for the > patch is to be ABI compatible with the current implementation. > > - The proposed patch involves a performance trade-off between saving the > costs of two atomic instructions when the counts are both 1 vs adding the > cost of loading the combined counts and comparison with two ones (e.g., > 0x10001). > > - The patch has been in use (built using LLVM) in a large environment for > many months. The performance gains outweigh the losses (roughly 10 to 1) > across a large variety of workloads. > > > I'd appreciate feedback on the patch and any suggestions for checking > __has_feature(thread_sanitizer). N.B. gmail completely mangles patches unless you send them as attachments. > diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h > b/libstdc++-v3/include/bits/shared_ptr_base.h > > index 368b2d7379a..a8fc944af5f 100644 > > --- a/libstdc++-v3/include/bits/shared_ptr_base.h > > +++ b/libstdc++-v3/include/bits/shared_ptr_base.h > > @@ -153,20 +153,78 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > > if (!_M_add_ref_lock_nothrow()) > > __throw_bad_weak_ptr(); > >} > > >bool > >_M_add_ref_lock_nothrow() noexcept; > > >void > >_M_release() noexcept > >{ > > +#if __SANITIZE_THREAD__ > > +_M_release_orig(); > > +return; > > +#endif > > +if (!__atomic_always_lock_free(sizeof(long long), 0) || The line break should come before the logical operator, not after. This makes it easier to see which operator it is, because it's at a
Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
Would appreciate any comments on this proposed patch. Thanks, Maged On Thu, Dec 17, 2020 at 3:49 PM Maged Michael wrote: > Please find a proposed patch for _Sp_counted_base::_M_release to skip the > two atomic instructions that decrement each of the use count and the weak > count when both are 1. I proposed the general idea in an earlier thread ( > https://gcc.gnu.org/pipermail/libstdc++/2020-December/051642.html) and > got useful feedback on a draft patch and responses to related questions > about multi-granular atomicity and alignment. This patch is based on that > feedback. > > > I added a check for thread sanitizer to use the current algorithm in that > case because TSAN does not support multi-granular atomicity. I'd like to > add a check of __has_feature(thread_sanitizer) for building using LLVM. I > found examples of __has_feature in libstdc++ but it doesn't seem to be > recognized in shared_ptr_base.h. Any guidance on how to check > __has_feature(thread_sanitizer) in this patch? > > > GCC generates code for _M_release that is larger and more complex than > that generated by LLVM. I'd like to file a bug report about that. Jonathan, > would you please create a bugzilla account for me ( > https://gcc.gnu.org/bugzilla/) using my gmail address. Thank you. > > > Information about the patch: > > - Benefits of the patch: Save the cost of the last atomic decrements of > each of the use count and the weak count in _Sp_counted_base. Atomic > instructions are significantly slower than regular loads and stores across > major architectures. > > - How current code works: _M_release() atomically decrements the use > count, checks if it was 1, if so calls _M_dispose(), atomically decrements > the weak count, checks if it was 1, and if so calls _M_destroy(). > > - How the proposed patch works: _M_release() loads both use count and weak > count together atomically (when properly aligned), checks if the value is > equal to the value of both counts equal to 1 (e.g., 0x10001), and if so > calls _M_dispose() and _M_destroy(). Otherwise, it follows the original > algorithm. > > - Why it works: When the current thread executing _M_release() finds each > of the counts is equal to 1, then (when _lock_policy is _S_atomic) no other > threads could possibly hold use or weak references to this control block. > That is, no other threads could possibly access the counts or the protected > object. > > - The proposed patch is intended to interact correctly with current code > (under certain conditions: _Lock_policy is _S_atomic, proper alignment, and > native lock-free support for atomic operations). That is, multiple threads > using different versions of the code with and without the patch operating > on the same objects should always interact correctly. The intent for the > patch is to be ABI compatible with the current implementation. > > - The proposed patch involves a performance trade-off between saving the > costs of two atomic instructions when the counts are both 1 vs adding the > cost of loading the combined counts and comparison with two ones (e.g., > 0x10001). > > - The patch has been in use (built using LLVM) in a large environment for > many months. The performance gains outweigh the losses (roughly 10 to 1) > across a large variety of workloads. > > > I'd appreciate feedback on the patch and any suggestions for checking > __has_feature(thread_sanitizer). > > > Maged > > > > diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h > b/libstdc++-v3/include/bits/shared_ptr_base.h > > index 368b2d7379a..a8fc944af5f 100644 > > --- a/libstdc++-v3/include/bits/shared_ptr_base.h > > +++ b/libstdc++-v3/include/bits/shared_ptr_base.h > > @@ -153,20 +153,78 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > > if (!_M_add_ref_lock_nothrow()) > > __throw_bad_weak_ptr(); > >} > > >bool > >_M_add_ref_lock_nothrow() noexcept; > > >void > >_M_release() noexcept > >{ > > +#if __SANITIZE_THREAD__ > > +_M_release_orig(); > > +return; > > +#endif > > +if (!__atomic_always_lock_free(sizeof(long long), 0) || > > +!__atomic_always_lock_free(sizeof(_Atomic_word), 0) || > > +sizeof(long long) < (2 * sizeof(_Atomic_word)) || > > +sizeof(long long) > (sizeof(void*))) > > + { > > +_M_release_orig(); > > +return; > > + } > > +_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count); > > +_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count); > > +if (__atomic_load_n((long long*)(&_M_use_count), > __ATOMIC_ACQUIRE) > > +== (1LL + (1LL << (8 * sizeof(_Atomic_word) > > + { > > +// Both counts are 1, so there are no weak references and > > +// we are releasing the last strong reference. No other > > +// threads can observe the effects of this _M_release() > > +// call (e.g. calling use_count(
Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
Please let me know if more information about this patch is needed. Thank you. Maged On Thu, Dec 17, 2020 at 3:49 PM Maged Michael wrote: > Please find a proposed patch for _Sp_counted_base::_M_release to skip the > two atomic instructions that decrement each of the use count and the weak > count when both are 1. I proposed the general idea in an earlier thread ( > https://gcc.gnu.org/pipermail/libstdc++/2020-December/051642.html) and > got useful feedback on a draft patch and responses to related questions > about multi-granular atomicity and alignment. This patch is based on that > feedback. > > > I added a check for thread sanitizer to use the current algorithm in that > case because TSAN does not support multi-granular atomicity. I'd like to > add a check of __has_feature(thread_sanitizer) for building using LLVM. I > found examples of __has_feature in libstdc++ but it doesn't seem to be > recognized in shared_ptr_base.h. Any guidance on how to check > __has_feature(thread_sanitizer) in this patch? > > > GCC generates code for _M_release that is larger and more complex than > that generated by LLVM. I'd like to file a bug report about that. Jonathan, > would you please create a bugzilla account for me ( > https://gcc.gnu.org/bugzilla/) using my gmail address. Thank you. > > > Information about the patch: > > - Benefits of the patch: Save the cost of the last atomic decrements of > each of the use count and the weak count in _Sp_counted_base. Atomic > instructions are significantly slower than regular loads and stores across > major architectures. > > - How current code works: _M_release() atomically decrements the use > count, checks if it was 1, if so calls _M_dispose(), atomically decrements > the weak count, checks if it was 1, and if so calls _M_destroy(). > > - How the proposed patch works: _M_release() loads both use count and weak > count together atomically (when properly aligned), checks if the value is > equal to the value of both counts equal to 1 (e.g., 0x10001), and if so > calls _M_dispose() and _M_destroy(). Otherwise, it follows the original > algorithm. > > - Why it works: When the current thread executing _M_release() finds each > of the counts is equal to 1, then (when _lock_policy is _S_atomic) no other > threads could possibly hold use or weak references to this control block. > That is, no other threads could possibly access the counts or the protected > object. > > - The proposed patch is intended to interact correctly with current code > (under certain conditions: _Lock_policy is _S_atomic, proper alignment, and > native lock-free support for atomic operations). That is, multiple threads > using different versions of the code with and without the patch operating > on the same objects should always interact correctly. The intent for the > patch is to be ABI compatible with the current implementation. > > - The proposed patch involves a performance trade-off between saving the > costs of two atomic instructions when the counts are both 1 vs adding the > cost of loading the combined counts and comparison with two ones (e.g., > 0x10001). > > - The patch has been in use (built using LLVM) in a large environment for > many months. The performance gains outweigh the losses (roughly 10 to 1) > across a large variety of workloads. > > > I'd appreciate feedback on the patch and any suggestions for checking > __has_feature(thread_sanitizer). > > > Maged > > > > diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h > b/libstdc++-v3/include/bits/shared_ptr_base.h > > index 368b2d7379a..a8fc944af5f 100644 > > --- a/libstdc++-v3/include/bits/shared_ptr_base.h > > +++ b/libstdc++-v3/include/bits/shared_ptr_base.h > > @@ -153,20 +153,78 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > > if (!_M_add_ref_lock_nothrow()) > > __throw_bad_weak_ptr(); > >} > > >bool > >_M_add_ref_lock_nothrow() noexcept; > > >void > >_M_release() noexcept > >{ > > +#if __SANITIZE_THREAD__ > > +_M_release_orig(); > > +return; > > +#endif > > +if (!__atomic_always_lock_free(sizeof(long long), 0) || > > +!__atomic_always_lock_free(sizeof(_Atomic_word), 0) || > > +sizeof(long long) < (2 * sizeof(_Atomic_word)) || > > +sizeof(long long) > (sizeof(void*))) > > + { > > +_M_release_orig(); > > +return; > > + } > > +_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count); > > +_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count); > > +if (__atomic_load_n((long long*)(&_M_use_count), > __ATOMIC_ACQUIRE) > > +== (1LL + (1LL << (8 * sizeof(_Atomic_word) > > + { > > +// Both counts are 1, so there are no weak references and > > +// we are releasing the last strong reference. No other > > +// threads can observe the effects of this _M_release() > > +// call (e.g. c
[PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
Please find a proposed patch for _Sp_counted_base::_M_release to skip the two atomic instructions that decrement each of the use count and the weak count when both are 1. I proposed the general idea in an earlier thread ( https://gcc.gnu.org/pipermail/libstdc++/2020-December/051642.html) and got useful feedback on a draft patch and responses to related questions about multi-granular atomicity and alignment. This patch is based on that feedback. I added a check for thread sanitizer to use the current algorithm in that case because TSAN does not support multi-granular atomicity. I'd like to add a check of __has_feature(thread_sanitizer) for building using LLVM. I found examples of __has_feature in libstdc++ but it doesn't seem to be recognized in shared_ptr_base.h. Any guidance on how to check __has_feature(thread_sanitizer) in this patch? GCC generates code for _M_release that is larger and more complex than that generated by LLVM. I'd like to file a bug report about that. Jonathan, would you please create a bugzilla account for me ( https://gcc.gnu.org/bugzilla/) using my gmail address. Thank you. Information about the patch: - Benefits of the patch: Save the cost of the last atomic decrements of each of the use count and the weak count in _Sp_counted_base. Atomic instructions are significantly slower than regular loads and stores across major architectures. - How current code works: _M_release() atomically decrements the use count, checks if it was 1, if so calls _M_dispose(), atomically decrements the weak count, checks if it was 1, and if so calls _M_destroy(). - How the proposed patch works: _M_release() loads both use count and weak count together atomically (when properly aligned), checks if the value is equal to the value of both counts equal to 1 (e.g., 0x10001), and if so calls _M_dispose() and _M_destroy(). Otherwise, it follows the original algorithm. - Why it works: When the current thread executing _M_release() finds each of the counts is equal to 1, then (when _lock_policy is _S_atomic) no other threads could possibly hold use or weak references to this control block. That is, no other threads could possibly access the counts or the protected object. - The proposed patch is intended to interact correctly with current code (under certain conditions: _Lock_policy is _S_atomic, proper alignment, and native lock-free support for atomic operations). That is, multiple threads using different versions of the code with and without the patch operating on the same objects should always interact correctly. The intent for the patch is to be ABI compatible with the current implementation. - The proposed patch involves a performance trade-off between saving the costs of two atomic instructions when the counts are both 1 vs adding the cost of loading the combined counts and comparison with two ones (e.g., 0x10001). - The patch has been in use (built using LLVM) in a large environment for many months. The performance gains outweigh the losses (roughly 10 to 1) across a large variety of workloads. I'd appreciate feedback on the patch and any suggestions for checking __has_feature(thread_sanitizer). Maged diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h b/libstdc++-v3/include/bits/shared_ptr_base.h index 368b2d7379a..a8fc944af5f 100644 --- a/libstdc++-v3/include/bits/shared_ptr_base.h +++ b/libstdc++-v3/include/bits/shared_ptr_base.h @@ -153,20 +153,78 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION if (!_M_add_ref_lock_nothrow()) __throw_bad_weak_ptr(); } bool _M_add_ref_lock_nothrow() noexcept; void _M_release() noexcept { +#if __SANITIZE_THREAD__ +_M_release_orig(); +return; +#endif +if (!__atomic_always_lock_free(sizeof(long long), 0) || +!__atomic_always_lock_free(sizeof(_Atomic_word), 0) || +sizeof(long long) < (2 * sizeof(_Atomic_word)) || +sizeof(long long) > (sizeof(void*))) + { +_M_release_orig(); +return; + } +_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count); +_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count); +if (__atomic_load_n((long long*)(&_M_use_count), __ATOMIC_ACQUIRE) +== (1LL + (1LL << (8 * sizeof(_Atomic_word) + { +// Both counts are 1, so there are no weak references and +// we are releasing the last strong reference. No other +// threads can observe the effects of this _M_release() +// call (e.g. calling use_count()) without a data race. +*(long long*)(&_M_use_count) = 0; +_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count); +_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count); +_M_dispose(); +_M_destroy(); + } +else + { +if ((__gnu_cxx::__exchange_and_