[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible

2024-04-24 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821

--- Comment #13 from Jan Hubicka  ---
Thanks a lot, looks great!
Do we still auto-detect memmove when the copy constructor turns out to be
memcpy equivalent after optimization?

[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible

2024-04-23 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821

Jonathan Wakely  changed:

   What|Removed |Added

  Attachment #58015|0   |1
is obsolete||

--- Comment #12 from Jonathan Wakely  ---
Created attachment 58019
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58019=edit
Make std::pair relocatable and simplify __relocate_a

More comprehensive patch.

With this, I see memcpy in the -fdump-tree-optimized dump.

[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible

2024-04-23 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821

Jonathan Wakely  changed:

   What|Removed |Added

  Attachment #58014|0   |1
is obsolete||

--- Comment #11 from Jonathan Wakely  ---
Created attachment 58015
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58015=edit
Make std::pair relocatable and simplify __relocate_a

Fixed patch.

[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible

2024-04-23 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821

--- Comment #10 from Jonathan Wakely  ---
(In reply to Jonathan Wakely from comment #7)
> Created attachment 58014 [details]
> Make std::pair relocatable and simplify __relocate_a
> 
> Does this help?

Oh hang on, that patch is wrong. I'll fix it and check the results myself ...

[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible

2024-04-23 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821

--- Comment #9 from Jan Hubicka  ---
Your patch gives me error compiling testcase

jh@ryzen3:/tmp> ~/trunk-install/bin/g++ -O3 ~/t.C 
In file included from /home/jh/trunk-install/include/c++/14.0.1/vector:65,
 from /home/jh/t.C:1:
/home/jh/trunk-install/include/c++/14.0.1/bits/stl_uninitialized.h: In
instantiation of ‘_ForwardIterator std::__relocate_a(_InputIterator,
_InputIterator, _ForwardIterator, _Allocator&) [with _InputIterator = const
pair*; _ForwardIterator = pair*; _Allocator = allocator >;
_Traits = allocator_traits > >]’:
/home/jh/trunk-install/include/c++/14.0.1/bits/stl_uninitialized.h:1127:31:  
required from ‘_Tp* std::__relocate_a(_Tp*, _Tp*, _Tp*, allocator<_T2>&) [with
_Tp = pair; _Up = pair]’
 1127 |   return std::__relocate_a(__cfirst, __clast, __result, __alloc);
  |  ~^~
/home/jh/trunk-install/include/c++/14.0.1/bits/stl_vector.h:509:26:   required
from ‘static std::vector<_Tp, _Alloc>::pointer std::vector<_Tp,
_Alloc>::_S_relocate(pointer, pointer, pointer, _Tp_alloc_type&) [with _Tp =
std::pair; _Alloc =
std::allocator >; pointer =
std::pair*; _Tp_alloc_type =
std::vector >::_Tp_alloc_type]’
  509 | return std::__relocate_a(__first, __last, __result, __alloc);
  |~^~~~
/home/jh/trunk-install/include/c++/14.0.1/bits/vector.tcc:647:32:   required
from ‘void std::vector<_Tp, _Alloc>::_M_realloc_append(_Args&& ...) [with _Args
= {const std::pair&}; _Tp = std::pair; _Alloc = std::allocator
>]’
  647 | __new_finish = _S_relocate(__old_start, __old_finish,
  |~~~^~~
  648 |__new_start,
_M_get_Tp_allocator());
  |   
~~~
/home/jh/trunk-install/include/c++/14.0.1/bits/stl_vector.h:1294:21:   required
from ‘void std::vector<_Tp, _Alloc>::push_back(const value_type&) [with _Tp =
std::pair; _Alloc =
std::allocator >; value_type =
std::pair]’
 1294 |   _M_realloc_append(__x);
  |   ~^
/home/jh/t.C:8:25:   required from here
8 | stack.push_back (pair);
  | ^~
/home/jh/trunk-install/include/c++/14.0.1/bits/stl_uninitialized.h:1084:56:
error: use of deleted function ‘const _Tp* std::addressof(const _Tp&&) [with
_Tp = pair]’
 1084 | 
std::addressof(std::move(*__first
  | 
~~^
In file included from
/home/jh/trunk-install/include/c++/14.0.1/bits/stl_pair.h:61,
 from
/home/jh/trunk-install/include/c++/14.0.1/bits/stl_algobase.h:64,
 from /home/jh/trunk-install/include/c++/14.0.1/vector:62:
/home/jh/trunk-install/include/c++/14.0.1/bits/move.h:168:16: note: declared
here
  168 | const _Tp* addressof(const _Tp&&) = delete;
  |^
/home/jh/trunk-install/include/c++/14.0.1/bits/stl_uninitialized.h:1084:56:
note: use ‘-fdiagnostics-all-candidates’ to display considered candidates
 1084 | 
std::addressof(std::move(*__first
  | 
~~^


It is easy to check if conversion happens - just compile it and see if there is
memcpy or memmove in the optimized dump file (or final assembly)

[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible

2024-04-23 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821

--- Comment #8 from Jan Hubicka  ---
I had wrong noexcept specifier.  This version works, but I still need to inline
relocate_object_a into the loop

diff --git a/libstdc++-v3/include/bits/stl_uninitialized.h
b/libstdc++-v3/include/bits/stl_uninitialized.h
index 7f84da31578..f02d4fb878f 100644
--- a/libstdc++-v3/include/bits/stl_uninitialized.h
+++ b/libstdc++-v3/include/bits/stl_uninitialized.h
@@ -1100,8 +1100,11 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
  "relocation is only possible for values of the same type");
   _ForwardIterator __cur = __result;
   for (; __first != __last; ++__first, (void)++__cur)
-   std::__relocate_object_a(std::__addressof(*__cur),
-std::__addressof(*__first), __alloc);
+   {
+ typedef std::allocator_traits<_Allocator> __traits;
+ __traits::construct(__alloc, std::__addressof(*__cur),
std::move(*std::__addressof(*__first)));
+ __traits::destroy(__alloc,
std::__addressof(*std::__addressof(*__first)));
+   }
   return __cur;
 }

@@ -1109,8 +1112,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   template 
 _GLIBCXX20_CONSTEXPR
 inline __enable_if_t::value, _Tp*>
-__relocate_a_1(_Tp* __first, _Tp* __last,
-  _Tp* __result,
+__relocate_a_1(_Tp* __restrict __first, _Tp* __last,
+  _Tp* __restrict __result,
   [[__maybe_unused__]] allocator<_Up>& __alloc) noexcept
 {
   ptrdiff_t __count = __last - __first;
@@ -1147,6 +1150,17 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 std::__niter_base(__result), __alloc);
 }

+  template 
+_GLIBCXX20_CONSTEXPR
+inline _Tp*
+__relocate_a(_Tp* __restrict __first, _Tp* __last,
+_Tp* __restrict __result,
+allocator<_Up>& __alloc)
+noexcept(noexcept(__relocate_a_1(__first, __last, __result, __alloc)))
+{
+  return std::__relocate_a_1(__first, __last, __result, __alloc);
+}
+
   /// @endcond
 #endif // C++11

[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible

2024-04-23 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821

--- Comment #7 from Jonathan Wakely  ---
Created attachment 58014
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58014=edit
Make std::pair relocatable and simplify __relocate_a

Does this help?

[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible

2024-04-23 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821

--- Comment #6 from Jan Hubicka  ---
Thanks. I though the relocate_a only cares about the fact if the pointed-to
type can be bitwise copied.  It would be nice to early produce memcpy from
libstdc++ for std::pair, so the second patch makes sense to me (I did not test
if it works)

I think it would be still nice to tell GCC that the copy loop never gets
overlapping memory locations so the cases which are not early optimized to
memcpy can still be optimized later (or vectorized if it does really something
non-trivial).

So i tried your second patch fixed so it compiles:
diff --git a/libstdc++-v3/include/bits/stl_uninitialized.h
b/libstdc++-v3/include/bits/stl_uninitialized.h
index 7f84da31578..0d2e588ae5e 100644
--- a/libstdc++-v3/include/bits/stl_uninitialized.h
+++ b/libstdc++-v3/include/bits/stl_uninitialized.h
@@ -1109,8 +1109,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   template 
 _GLIBCXX20_CONSTEXPR
 inline __enable_if_t::value, _Tp*>
-__relocate_a_1(_Tp* __first, _Tp* __last,
-  _Tp* __result,
+__relocate_a_1(_Tp* __restrict __first, _Tp* __last,
+  _Tp* __restrict __result,
   [[__maybe_unused__]] allocator<_Up>& __alloc) noexcept
 {
   ptrdiff_t __count = __last - __first;
@@ -1147,6 +1147,17 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 std::__niter_base(__result), __alloc);
 }

+  template 
+_GLIBCXX20_CONSTEXPR
+inline _Tp*
+__relocate_a(_Tp* __restrict __first, _Tp* __last,
+_Tp* __restrict __result,
+allocator<_Up>& __alloc)
+noexcept(std::__is_bitwise_relocatable<_Tp>::value)
+{
+  return std::__relocate_a_1(__first, __last, __result, __alloc);
+}
+
   /// @endcond
 #endif // C++11

it does not make ldist to hit, so the restrict info is still lost.  I think the
problem is that if you call relocate_object the restrict reduces scope, so we
only know that the elements are pairwise disjoint, not that the vectors are.
This is because restrict is interpreted early pre-inlining, but it is really
Richard's area.

It seems that the patch makes us to go through __uninitialized_copy_a instead
of __uninit_copy. I am not even sure how these are different, so I need to
stare at the code bit more to make sense of it :)

[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible

2024-04-23 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821

--- Comment #5 from Jonathan Wakely  ---
If the problem is simply that the __restrict qualifiers are not present because
we go through the generic function taking iterators, then we can just add:

--- a/libstdc++-v3/include/bits/stl_uninitialized.h
+++ b/libstdc++-v3/include/bits/stl_uninitialized.h
@@ -1109,8 +1109,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   template 
 _GLIBCXX20_CONSTEXPR
 inline __enable_if_t::value, _Tp*>
-__relocate_a_1(_Tp* __first, _Tp* __last,
-  _Tp* __result,
+__relocate_a_1(_Tp* __restrict __first, _Tp* __last,
+  _Tp* __restrict __result,
   [[__maybe_unused__]] allocator<_Up>& __alloc) noexcept
 {
   ptrdiff_t __count = __last - __first;
@@ -1147,6 +1147,17 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 std::__niter_base(__result), __alloc);
 }

+  template 
+_GLIBCXX20_CONSTEXPR
+inline _ForwardIterator
+__relocate_a(_Tp* __restrict __first, _Tp* __last,
+_Tp* __restrict __result,
+[[__maybe_unused__]] allocator<_Up>& __alloc) noexcept
+noexcept(std::__is_bitwise_relocatable<_Tp>::value)
+{
+  return std::__relocate_a_1(__first, __last, __result, __alloc);
+}
+
   /// @endcond
 #endif // C++11


If the problem is that std::pair is not bitwise_relocatable, then we
could change that (as Marc suggested as a possible future enhancement):

--- a/libstdc++-v3/include/bits/stl_uninitialized.h
+++ b/libstdc++-v3/include/bits/stl_uninitialized.h
@@ -1082,6 +1082,11 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 struct __is_bitwise_relocatable
 : is_trivial<_Tp> { };

+  template
+struct __is_bitwise_relocatable, void>
+: __and_, is_trivial<_Up>>
+{ };
+
   template 
 _GLIBCXX20_CONSTEXPR

[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible

2024-04-23 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821

--- Comment #4 from Jonathan Wakely  ---
(In reply to Jan Hubicka from comment #2)
> --- a/libstdc++-v3/include/bits/stl_uninitialized.h
> +++ b/libstdc++-v3/include/bits/stl_uninitialized.h
> @@ -1130,7 +1130,58 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> }
>return __result + __count;
>  }
> +
> +  template 
> +_GLIBCXX20_CONSTEXPR
> +inline __enable_if_t::value, _Tp*>
> +__relocate_a(_Tp * __restrict __first, _Tp *__last,
> +_Tp * __restrict __result, _Allocator& __alloc) noexcept

This is wrong, we can't optimize arbitrary allocators, only when the allocator
is std::allocator<_Tp>. That's what the existing code is doing with the
indirection from __relocate_a to __relocate_a_1.

> +{
> +  ptrdiff_t __count = __last - __first;
> +  if (__count > 0)
> +   {
> +#ifdef __cpp_lib_is_constant_evaluated
> + if (std::is_constant_evaluated())
> +   {
> + for (; __first != __last; ++__first, (void)++__result)

You don't need the (void) case here because __first and __result are both
pointers. That's only needed for the generic __relocate_a that deals with
arbitrary iterator types.

> +   {
> + // manually inline relocate_object_a to not lose restrict
> qualifiers

We don't care about restrict when is_constant_evaluated is true, we're not
optimizing this code. It just gets interpreted at compile time. There is no
reason to inline __relocate_object_a for the is_constant_evaluated case.


> + typedef std::allocator_traits<_Allocator> __traits;
> + __traits::construct(__alloc, __result,
> std::move(*__first));
> + __traits::destroy(__alloc, std::__addressof(*__first));
> +   }
> + return __result;
> +   }
>  #endif
> + __builtin_memcpy(__result, __first, __count * sizeof(_Tp));
> +   }
> +  return __result + __count;
> +}
> +#endif
> +
> +  template 
> +_GLIBCXX20_CONSTEXPR
> +#if _GLIBCXX_HOSTED
> +inline __enable_if_t::value, _Tp*>
> +#else
> +inline _Tp *
> +#endif
> +__relocate_a(_Tp * __restrict __first, _Tp *__last,
> +_Tp * __restrict __result, _Allocator& __alloc)
> +noexcept(noexcept(std::allocator_traits<_Allocator>::construct(__alloc,
> +__result, std::move(*__first)))
> +&& noexcept(std::allocator_traits<_Allocator>::destroy(
> +   __alloc, std::__addressof(*__first
> +{
> +  for (; __first != __last; ++__first, (void)++__result)
> +   {
> + // manually inline relocate_object_a to not lose restrict
> qualifiers
> + typedef std::allocator_traits<_Allocator> __traits;
> + __traits::construct(__alloc, __result, std::move(*__first));
> + __traits::destroy(__alloc, std::__addressof(*__first));
> +   }

I don't understand this overload at all. Is this overload the one that actually
gets used for your testcase? I think it should be, because std::pair
is not bitwise_relocatable.

Why does the restrict qualifier get lost if we don't inline relocate_object_a?
That function is restrict-qualified as well.


> +  return __result;
> +}
>  
>template  typename _Allocator>

[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible

2024-04-23 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821

--- Comment #3 from Jonathan Wakely  ---
(In reply to Jan Hubicka from comment #2)
> This seems to do the trick, but for some reason I get memmove

What's the second overload for, and why does it depend on _GLIBCXX_HOSTED?

[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible

2024-04-23 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821

--- Comment #2 from Jan Hubicka  ---
What I am shooting for is to optimize it later in loop distribution. We can
recognize memcpy loop if we can figure out that source and destination memory
are different.

We can help here with restrict, but I was bit lost in how to get them done.

This seems to do the trick, but for some reason I get memmove

diff --git a/libstdc++-v3/include/bits/stl_uninitialized.h
b/libstdc++-v3/include/bits/stl_uninitialized.h
index 7f84da31578..1a6223ea892 100644
--- a/libstdc++-v3/include/bits/stl_uninitialized.h
+++ b/libstdc++-v3/include/bits/stl_uninitialized.h
@@ -1130,7 +1130,58 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
}
   return __result + __count;
 }
+
+  template 
+_GLIBCXX20_CONSTEXPR
+inline __enable_if_t::value, _Tp*>
+__relocate_a(_Tp * __restrict __first, _Tp *__last,
+_Tp * __restrict __result, _Allocator& __alloc) noexcept
+{
+  ptrdiff_t __count = __last - __first;
+  if (__count > 0)
+   {
+#ifdef __cpp_lib_is_constant_evaluated
+ if (std::is_constant_evaluated())
+   {
+ for (; __first != __last; ++__first, (void)++__result)
+   {
+ // manually inline relocate_object_a to not lose restrict
qualifiers
+ typedef std::allocator_traits<_Allocator> __traits;
+ __traits::construct(__alloc, __result, std::move(*__first));
+ __traits::destroy(__alloc, std::__addressof(*__first));
+   }
+ return __result;
+   }
 #endif
+ __builtin_memcpy(__result, __first, __count * sizeof(_Tp));
+   }
+  return __result + __count;
+}
+#endif
+
+  template 
+_GLIBCXX20_CONSTEXPR
+#if _GLIBCXX_HOSTED
+inline __enable_if_t::value, _Tp*>
+#else
+inline _Tp *
+#endif
+__relocate_a(_Tp * __restrict __first, _Tp *__last,
+_Tp * __restrict __result, _Allocator& __alloc)
+noexcept(noexcept(std::allocator_traits<_Allocator>::construct(__alloc,
+__result, std::move(*__first)))
+&& noexcept(std::allocator_traits<_Allocator>::destroy(
+   __alloc, std::__addressof(*__first
+{
+  for (; __first != __last; ++__first, (void)++__result)
+   {
+ // manually inline relocate_object_a to not lose restrict qualifiers
+ typedef std::allocator_traits<_Allocator> __traits;
+ __traits::construct(__alloc, __result, std::move(*__first));
+ __traits::destroy(__alloc, std::__addressof(*__first));
+   }
+  return __result;
+}

   template 

[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible

2024-04-23 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821

--- Comment #1 from Jonathan Wakely  ---
Using memcpy on any std::pair is undefined behaviour because it's not trivially
copyable. 

That's not because it has a copy constructor, its copy constructor is defaulted
and so trivial if the data members are trivially copy constructible:
  constexpr pair(const pair&) = default;///< Copy constructor

It's because it has a non-trivial assignment operator:

  /// Copy assignment operator
  constexpr pair&
  operator=(const pair& __p)
  noexcept(_S_nothrow_assignable())
  requires (_S_assignable())
  {
first = __p.first;
second = __p.second;
return *this;
  }


I think this exact point was discussed when Marc introduced the relocate
optimizations. We could maybe cheat and say that we know it's safe to memcpy
std::pair even though the language says it's undefined, because we
know what our std::pair implementation looks like.