This goes on top of the main [simd] patch and is in preparation of an implementation for [simd.permute.dynamic].
How could/should I test this change? I have a test that compiles
auto f10(simd::vec<short, 2> a, simd::vec<short, 2> b,
simd::vec<short, 2> c, simd::vec<short, 2> d,
simd::vec<short, 8> e) {
g(simd::cat(a, b, c, d, e));
}
to
vmovdqa xmm3, xmm0
vmovd xmm1, esi
vmovd xmm0, edi
vinsertps xmm0, xmm0, xmm1, 16
vmovd xmm2, ecx
vmovd xmm1, edx
vinsertps xmm1, xmm1, xmm2, 16
vmovq xmm0, xmm0
vmovq xmm1, xmm1
vpunpcklqdq xmm0, xmm0, xmm1
vmovdqa xmm1, xmm3
vmovdqa xmm0, xmm0
vperm2i128 ymm0, ymm0, ymm1, 32
The 2x insertps -> unpck -> 128-bit concat sequence shows it's doing the
expected sequence. But I really hope this will turn into
vmovd xmm2, edi
vmovd xmm1, edx
vpinsrd xmm2, xmm2, esi, 1
vpinsrd xmm1, xmm1, ecx, 1
vpunpcklqdq xmm1, xmm2, xmm1
vinserti128 ymm0, ymm1, xmm0, 1
at some point. I don't think we want such detailed code-gen tests here. A test
like this belongs into the gcc/testsuite.
---
cat(a, b, c, d) where each argument is e.g. 2 elements wide, would fold
from the left before this change: (2, 2, 2, 2) -> (4, 2, 2) -> (6, 2) ->
(8). It is better for ILP (and to avoid load and store instructions) to
go via (4, 2, 2) -> (4, 4) -> (8).
In theory, for even larger number of arguments, the current
implementation still isn't good enough. But larger number of arguments
is something users shouldn't be doing anyway.
Signed-off-by: Matthias Kretz <[email protected]>
libstdc++-v3/ChangeLog:
* include/bits/vec_ops.h (__vec_concat_sized): Add an overload
that concatenates the second and third operand, if they are
smaller than the first.
---
libstdc++-v3/include/bits/vec_ops.h | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/libstdc++-v3/include/bits/vec_ops.h b/libstdc++-v3/include/bits/
vec_ops.h
index 0e89c89b7af5..e5bf2f1497cd 100644
--- a/libstdc++-v3/include/bits/vec_ops.h
+++ b/libstdc++-v3/include/bits/vec_ops.h
@@ -187,7 +187,30 @@ __vec_concat(_TV __a, _TV __b)
* with the elements from applying this function recursively to @p __rest.
*
* @pre _N0 <= __width_of<_TV0> && _N1 <= __width_of<_TV1> && _Ns <=
__width_of<_TVs> && ...
+ *
+ * Strategy: Aim for a power-of-2 tree concat. E.g.
+ * - cat(2, 2, 2, 2) -> cat(4, 2, 2) -> cat(4, 4)
+ * - cat(2, 2, 2, 2, 8) -> cat(4, 2, 2, 8) -> cat(4, 4, 8) -> cat(8, 8)
*/
+ template <int _N0, int _N1, int... _Ns, __vec_builtin _TV0, __vec_builtin
_TV1,
+ __vec_builtin... _TVs>
+ [[__gnu__::__always_inline__]]
+ constexpr __vec_builtin_type<__vec_value_type<_TV0>,
+ __bit_ceil(unsigned(_N0 + (_N1 + ... +
_Ns)))>
+ __vec_concat_sized(const _TV0& __a, const _TV1& __b, const _TVs&...
__rest);
+
+ template <int _N0, int _N1, int _N2, int... _Ns, __vec_builtin _TV0,
__vec_builtin _TV1,
+ __vec_builtin _TV2, __vec_builtin... _TVs>
+ requires (__has_single_bit(unsigned(_N0))) && (_N0 >= (_N1 + _N2))
+ [[__gnu__::__always_inline__]]
+ constexpr __vec_builtin_type<__vec_value_type<_TV0>,
+ __bit_ceil(unsigned(_N0 + _N1 + (_N2 + ... +
_Ns)))>
+ __vec_concat_sized(const _TV0& __a, const _TV1& __b, const _TV2& __c,
const _TVs&... __rest)
+ {
+ return __vec_concat_sized<_N0, _N1 + _N2, _Ns...>(
+ __a, __vec_concat_sized<_N1, _N2>(__b, __c), __rest...);
+ }
+
template <int _N0, int _N1, int... _Ns, __vec_builtin _TV0, __vec_builtin
_TV1,
__vec_builtin... _TVs>
[[__gnu__::__always_inline__]]
--
──────────────────────────────────────────────────────────────────────────
Dr. Matthias Kretz https://mattkretz.github.io
GSI Helmholtz Center for Heavy Ion Research https://gsi.de
std::simd
──────────────────────────────────────────────────────────────────────────
signature.asc
Description: This is a digitally signed message part.
