[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821 --- Comment #13 from Jan Hubicka --- Thanks a lot, looks great! Do we still auto-detect memmove when the copy constructor turns out to be memcpy equivalent after optimization?
[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821 Jonathan Wakely changed: What|Removed |Added Attachment #58015|0 |1 is obsolete|| --- Comment #12 from Jonathan Wakely --- Created attachment 58019 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58019=edit Make std::pair relocatable and simplify __relocate_a More comprehensive patch. With this, I see memcpy in the -fdump-tree-optimized dump.
[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821 Jonathan Wakely changed: What|Removed |Added Attachment #58014|0 |1 is obsolete|| --- Comment #11 from Jonathan Wakely --- Created attachment 58015 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58015=edit Make std::pair relocatable and simplify __relocate_a Fixed patch.
[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821 --- Comment #10 from Jonathan Wakely --- (In reply to Jonathan Wakely from comment #7) > Created attachment 58014 [details] > Make std::pair relocatable and simplify __relocate_a > > Does this help? Oh hang on, that patch is wrong. I'll fix it and check the results myself ...
[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821 --- Comment #9 from Jan Hubicka --- Your patch gives me error compiling testcase jh@ryzen3:/tmp> ~/trunk-install/bin/g++ -O3 ~/t.C In file included from /home/jh/trunk-install/include/c++/14.0.1/vector:65, from /home/jh/t.C:1: /home/jh/trunk-install/include/c++/14.0.1/bits/stl_uninitialized.h: In instantiation of ‘_ForwardIterator std::__relocate_a(_InputIterator, _InputIterator, _ForwardIterator, _Allocator&) [with _InputIterator = const pair*; _ForwardIterator = pair*; _Allocator = allocator >; _Traits = allocator_traits > >]’: /home/jh/trunk-install/include/c++/14.0.1/bits/stl_uninitialized.h:1127:31: required from ‘_Tp* std::__relocate_a(_Tp*, _Tp*, _Tp*, allocator<_T2>&) [with _Tp = pair; _Up = pair]’ 1127 | return std::__relocate_a(__cfirst, __clast, __result, __alloc); | ~^~ /home/jh/trunk-install/include/c++/14.0.1/bits/stl_vector.h:509:26: required from ‘static std::vector<_Tp, _Alloc>::pointer std::vector<_Tp, _Alloc>::_S_relocate(pointer, pointer, pointer, _Tp_alloc_type&) [with _Tp = std::pair; _Alloc = std::allocator >; pointer = std::pair*; _Tp_alloc_type = std::vector >::_Tp_alloc_type]’ 509 | return std::__relocate_a(__first, __last, __result, __alloc); |~^~~~ /home/jh/trunk-install/include/c++/14.0.1/bits/vector.tcc:647:32: required from ‘void std::vector<_Tp, _Alloc>::_M_realloc_append(_Args&& ...) [with _Args = {const std::pair&}; _Tp = std::pair; _Alloc = std::allocator >]’ 647 | __new_finish = _S_relocate(__old_start, __old_finish, |~~~^~~ 648 |__new_start, _M_get_Tp_allocator()); | ~~~ /home/jh/trunk-install/include/c++/14.0.1/bits/stl_vector.h:1294:21: required from ‘void std::vector<_Tp, _Alloc>::push_back(const value_type&) [with _Tp = std::pair; _Alloc = std::allocator >; value_type = std::pair]’ 1294 | _M_realloc_append(__x); | ~^ /home/jh/t.C:8:25: required from here 8 | stack.push_back (pair); | ^~ /home/jh/trunk-install/include/c++/14.0.1/bits/stl_uninitialized.h:1084:56: error: use of deleted function ‘const _Tp* std::addressof(const _Tp&&) [with _Tp = pair]’ 1084 | std::addressof(std::move(*__first | ~~^ In file included from /home/jh/trunk-install/include/c++/14.0.1/bits/stl_pair.h:61, from /home/jh/trunk-install/include/c++/14.0.1/bits/stl_algobase.h:64, from /home/jh/trunk-install/include/c++/14.0.1/vector:62: /home/jh/trunk-install/include/c++/14.0.1/bits/move.h:168:16: note: declared here 168 | const _Tp* addressof(const _Tp&&) = delete; |^ /home/jh/trunk-install/include/c++/14.0.1/bits/stl_uninitialized.h:1084:56: note: use ‘-fdiagnostics-all-candidates’ to display considered candidates 1084 | std::addressof(std::move(*__first | ~~^ It is easy to check if conversion happens - just compile it and see if there is memcpy or memmove in the optimized dump file (or final assembly)
[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821 --- Comment #8 from Jan Hubicka --- I had wrong noexcept specifier. This version works, but I still need to inline relocate_object_a into the loop diff --git a/libstdc++-v3/include/bits/stl_uninitialized.h b/libstdc++-v3/include/bits/stl_uninitialized.h index 7f84da31578..f02d4fb878f 100644 --- a/libstdc++-v3/include/bits/stl_uninitialized.h +++ b/libstdc++-v3/include/bits/stl_uninitialized.h @@ -1100,8 +1100,11 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION "relocation is only possible for values of the same type"); _ForwardIterator __cur = __result; for (; __first != __last; ++__first, (void)++__cur) - std::__relocate_object_a(std::__addressof(*__cur), -std::__addressof(*__first), __alloc); + { + typedef std::allocator_traits<_Allocator> __traits; + __traits::construct(__alloc, std::__addressof(*__cur), std::move(*std::__addressof(*__first))); + __traits::destroy(__alloc, std::__addressof(*std::__addressof(*__first))); + } return __cur; } @@ -1109,8 +1112,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION template _GLIBCXX20_CONSTEXPR inline __enable_if_t::value, _Tp*> -__relocate_a_1(_Tp* __first, _Tp* __last, - _Tp* __result, +__relocate_a_1(_Tp* __restrict __first, _Tp* __last, + _Tp* __restrict __result, [[__maybe_unused__]] allocator<_Up>& __alloc) noexcept { ptrdiff_t __count = __last - __first; @@ -1147,6 +1150,17 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION std::__niter_base(__result), __alloc); } + template +_GLIBCXX20_CONSTEXPR +inline _Tp* +__relocate_a(_Tp* __restrict __first, _Tp* __last, +_Tp* __restrict __result, +allocator<_Up>& __alloc) +noexcept(noexcept(__relocate_a_1(__first, __last, __result, __alloc))) +{ + return std::__relocate_a_1(__first, __last, __result, __alloc); +} + /// @endcond #endif // C++11
[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821 --- Comment #7 from Jonathan Wakely --- Created attachment 58014 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58014=edit Make std::pair relocatable and simplify __relocate_a Does this help?
[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821 --- Comment #6 from Jan Hubicka --- Thanks. I though the relocate_a only cares about the fact if the pointed-to type can be bitwise copied. It would be nice to early produce memcpy from libstdc++ for std::pair, so the second patch makes sense to me (I did not test if it works) I think it would be still nice to tell GCC that the copy loop never gets overlapping memory locations so the cases which are not early optimized to memcpy can still be optimized later (or vectorized if it does really something non-trivial). So i tried your second patch fixed so it compiles: diff --git a/libstdc++-v3/include/bits/stl_uninitialized.h b/libstdc++-v3/include/bits/stl_uninitialized.h index 7f84da31578..0d2e588ae5e 100644 --- a/libstdc++-v3/include/bits/stl_uninitialized.h +++ b/libstdc++-v3/include/bits/stl_uninitialized.h @@ -1109,8 +1109,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION template _GLIBCXX20_CONSTEXPR inline __enable_if_t::value, _Tp*> -__relocate_a_1(_Tp* __first, _Tp* __last, - _Tp* __result, +__relocate_a_1(_Tp* __restrict __first, _Tp* __last, + _Tp* __restrict __result, [[__maybe_unused__]] allocator<_Up>& __alloc) noexcept { ptrdiff_t __count = __last - __first; @@ -1147,6 +1147,17 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION std::__niter_base(__result), __alloc); } + template +_GLIBCXX20_CONSTEXPR +inline _Tp* +__relocate_a(_Tp* __restrict __first, _Tp* __last, +_Tp* __restrict __result, +allocator<_Up>& __alloc) +noexcept(std::__is_bitwise_relocatable<_Tp>::value) +{ + return std::__relocate_a_1(__first, __last, __result, __alloc); +} + /// @endcond #endif // C++11 it does not make ldist to hit, so the restrict info is still lost. I think the problem is that if you call relocate_object the restrict reduces scope, so we only know that the elements are pairwise disjoint, not that the vectors are. This is because restrict is interpreted early pre-inlining, but it is really Richard's area. It seems that the patch makes us to go through __uninitialized_copy_a instead of __uninit_copy. I am not even sure how these are different, so I need to stare at the code bit more to make sense of it :)
[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821 --- Comment #5 from Jonathan Wakely --- If the problem is simply that the __restrict qualifiers are not present because we go through the generic function taking iterators, then we can just add: --- a/libstdc++-v3/include/bits/stl_uninitialized.h +++ b/libstdc++-v3/include/bits/stl_uninitialized.h @@ -1109,8 +1109,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION template _GLIBCXX20_CONSTEXPR inline __enable_if_t::value, _Tp*> -__relocate_a_1(_Tp* __first, _Tp* __last, - _Tp* __result, +__relocate_a_1(_Tp* __restrict __first, _Tp* __last, + _Tp* __restrict __result, [[__maybe_unused__]] allocator<_Up>& __alloc) noexcept { ptrdiff_t __count = __last - __first; @@ -1147,6 +1147,17 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION std::__niter_base(__result), __alloc); } + template +_GLIBCXX20_CONSTEXPR +inline _ForwardIterator +__relocate_a(_Tp* __restrict __first, _Tp* __last, +_Tp* __restrict __result, +[[__maybe_unused__]] allocator<_Up>& __alloc) noexcept +noexcept(std::__is_bitwise_relocatable<_Tp>::value) +{ + return std::__relocate_a_1(__first, __last, __result, __alloc); +} + /// @endcond #endif // C++11 If the problem is that std::pair is not bitwise_relocatable, then we could change that (as Marc suggested as a possible future enhancement): --- a/libstdc++-v3/include/bits/stl_uninitialized.h +++ b/libstdc++-v3/include/bits/stl_uninitialized.h @@ -1082,6 +1082,11 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION struct __is_bitwise_relocatable : is_trivial<_Tp> { }; + template +struct __is_bitwise_relocatable, void> +: __and_, is_trivial<_Up>> +{ }; + template _GLIBCXX20_CONSTEXPR
[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821 --- Comment #4 from Jonathan Wakely --- (In reply to Jan Hubicka from comment #2) > --- a/libstdc++-v3/include/bits/stl_uninitialized.h > +++ b/libstdc++-v3/include/bits/stl_uninitialized.h > @@ -1130,7 +1130,58 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > } >return __result + __count; > } > + > + template > +_GLIBCXX20_CONSTEXPR > +inline __enable_if_t::value, _Tp*> > +__relocate_a(_Tp * __restrict __first, _Tp *__last, > +_Tp * __restrict __result, _Allocator& __alloc) noexcept This is wrong, we can't optimize arbitrary allocators, only when the allocator is std::allocator<_Tp>. That's what the existing code is doing with the indirection from __relocate_a to __relocate_a_1. > +{ > + ptrdiff_t __count = __last - __first; > + if (__count > 0) > + { > +#ifdef __cpp_lib_is_constant_evaluated > + if (std::is_constant_evaluated()) > + { > + for (; __first != __last; ++__first, (void)++__result) You don't need the (void) case here because __first and __result are both pointers. That's only needed for the generic __relocate_a that deals with arbitrary iterator types. > + { > + // manually inline relocate_object_a to not lose restrict > qualifiers We don't care about restrict when is_constant_evaluated is true, we're not optimizing this code. It just gets interpreted at compile time. There is no reason to inline __relocate_object_a for the is_constant_evaluated case. > + typedef std::allocator_traits<_Allocator> __traits; > + __traits::construct(__alloc, __result, > std::move(*__first)); > + __traits::destroy(__alloc, std::__addressof(*__first)); > + } > + return __result; > + } > #endif > + __builtin_memcpy(__result, __first, __count * sizeof(_Tp)); > + } > + return __result + __count; > +} > +#endif > + > + template > +_GLIBCXX20_CONSTEXPR > +#if _GLIBCXX_HOSTED > +inline __enable_if_t::value, _Tp*> > +#else > +inline _Tp * > +#endif > +__relocate_a(_Tp * __restrict __first, _Tp *__last, > +_Tp * __restrict __result, _Allocator& __alloc) > +noexcept(noexcept(std::allocator_traits<_Allocator>::construct(__alloc, > +__result, std::move(*__first))) > +&& noexcept(std::allocator_traits<_Allocator>::destroy( > + __alloc, std::__addressof(*__first > +{ > + for (; __first != __last; ++__first, (void)++__result) > + { > + // manually inline relocate_object_a to not lose restrict > qualifiers > + typedef std::allocator_traits<_Allocator> __traits; > + __traits::construct(__alloc, __result, std::move(*__first)); > + __traits::destroy(__alloc, std::__addressof(*__first)); > + } I don't understand this overload at all. Is this overload the one that actually gets used for your testcase? I think it should be, because std::pair is not bitwise_relocatable. Why does the restrict qualifier get lost if we don't inline relocate_object_a? That function is restrict-qualified as well. > + return __result; > +} > >template typename _Allocator>
[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821 --- Comment #3 from Jonathan Wakely --- (In reply to Jan Hubicka from comment #2) > This seems to do the trick, but for some reason I get memmove What's the second overload for, and why does it depend on _GLIBCXX_HOSTED?
[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821 --- Comment #2 from Jan Hubicka --- What I am shooting for is to optimize it later in loop distribution. We can recognize memcpy loop if we can figure out that source and destination memory are different. We can help here with restrict, but I was bit lost in how to get them done. This seems to do the trick, but for some reason I get memmove diff --git a/libstdc++-v3/include/bits/stl_uninitialized.h b/libstdc++-v3/include/bits/stl_uninitialized.h index 7f84da31578..1a6223ea892 100644 --- a/libstdc++-v3/include/bits/stl_uninitialized.h +++ b/libstdc++-v3/include/bits/stl_uninitialized.h @@ -1130,7 +1130,58 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION } return __result + __count; } + + template +_GLIBCXX20_CONSTEXPR +inline __enable_if_t::value, _Tp*> +__relocate_a(_Tp * __restrict __first, _Tp *__last, +_Tp * __restrict __result, _Allocator& __alloc) noexcept +{ + ptrdiff_t __count = __last - __first; + if (__count > 0) + { +#ifdef __cpp_lib_is_constant_evaluated + if (std::is_constant_evaluated()) + { + for (; __first != __last; ++__first, (void)++__result) + { + // manually inline relocate_object_a to not lose restrict qualifiers + typedef std::allocator_traits<_Allocator> __traits; + __traits::construct(__alloc, __result, std::move(*__first)); + __traits::destroy(__alloc, std::__addressof(*__first)); + } + return __result; + } #endif + __builtin_memcpy(__result, __first, __count * sizeof(_Tp)); + } + return __result + __count; +} +#endif + + template +_GLIBCXX20_CONSTEXPR +#if _GLIBCXX_HOSTED +inline __enable_if_t::value, _Tp*> +#else +inline _Tp * +#endif +__relocate_a(_Tp * __restrict __first, _Tp *__last, +_Tp * __restrict __result, _Allocator& __alloc) +noexcept(noexcept(std::allocator_traits<_Allocator>::construct(__alloc, +__result, std::move(*__first))) +&& noexcept(std::allocator_traits<_Allocator>::destroy( + __alloc, std::__addressof(*__first +{ + for (; __first != __last; ++__first, (void)++__result) + { + // manually inline relocate_object_a to not lose restrict qualifiers + typedef std::allocator_traits<_Allocator> __traits; + __traits::construct(__alloc, __result, std::move(*__first)); + __traits::destroy(__alloc, std::__addressof(*__first)); + } + return __result; +} template
[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821 --- Comment #1 from Jonathan Wakely --- Using memcpy on any std::pair is undefined behaviour because it's not trivially copyable. That's not because it has a copy constructor, its copy constructor is defaulted and so trivial if the data members are trivially copy constructible: constexpr pair(const pair&) = default;///< Copy constructor It's because it has a non-trivial assignment operator: /// Copy assignment operator constexpr pair& operator=(const pair& __p) noexcept(_S_nothrow_assignable()) requires (_S_assignable()) { first = __p.first; second = __p.second; return *this; } I think this exact point was discussed when Marc introduced the relocate optimizations. We could maybe cheat and say that we know it's safe to memcpy std::pair even though the language says it's undefined, because we know what our std::pair implementation looks like.