[PATCH] libstdc++: Improve simd::cat implementation strategy

Matthias Kretz Fri, 20 Feb 2026 02:29:21 -0800

This goes on top of the main [simd] patch and is in preparation of an 
implementation for [simd.permute.dynamic].


How could/should I test this change? I have a test that compiles

auto f10(simd::vec<short, 2> a, simd::vec<short, 2> b,
         simd::vec<short, 2> c, simd::vec<short, 2> d,
         simd::vec<short, 8> e) {
  g(simd::cat(a, b, c, d, e));
}

to
        vmovdqa xmm3, xmm0
        vmovd   xmm1, esi
        vmovd   xmm0, edi
        vinsertps       xmm0, xmm0, xmm1, 16
        vmovd   xmm2, ecx
        vmovd   xmm1, edx
        vinsertps       xmm1, xmm1, xmm2, 16
        vmovq   xmm0, xmm0
        vmovq   xmm1, xmm1
        vpunpcklqdq     xmm0, xmm0, xmm1
        vmovdqa xmm1, xmm3
        vmovdqa xmm0, xmm0
        vperm2i128      ymm0, ymm0, ymm1, 32

The 2x insertps -> unpck -> 128-bit concat sequence shows it's doing the 
expected sequence. But I really hope this will turn into
        vmovd   xmm2, edi
        vmovd   xmm1, edx
        vpinsrd xmm2, xmm2, esi, 1
        vpinsrd xmm1, xmm1, ecx, 1
        vpunpcklqdq     xmm1, xmm2, xmm1
        vinserti128     ymm0, ymm1, xmm0, 1

at some point. I don't think we want such detailed code-gen tests here. A test 
like this belongs into the gcc/testsuite.

---

cat(a, b, c, d) where each argument is e.g. 2 elements wide, would fold
from the left before this change: (2, 2, 2, 2) -> (4, 2, 2) -> (6, 2) ->
(8). It is better for ILP (and to avoid load and store instructions) to
go via (4, 2, 2) -> (4, 4) -> (8).

In theory, for even larger number of arguments, the current
implementation still isn't good enough. But larger number of arguments
is something users shouldn't be doing anyway.

Signed-off-by: Matthias Kretz <[email protected]>

libstdc++-v3/ChangeLog:

        * include/bits/vec_ops.h (__vec_concat_sized): Add an overload
        that concatenates the second and third operand, if they are
        smaller than the first.
---
 libstdc++-v3/include/bits/vec_ops.h | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/libstdc++-v3/include/bits/vec_ops.h b/libstdc++-v3/include/bits/
vec_ops.h
index 0e89c89b7af5..e5bf2f1497cd 100644
--- a/libstdc++-v3/include/bits/vec_ops.h
+++ b/libstdc++-v3/include/bits/vec_ops.h
@@ -187,7 +187,30 @@ __vec_concat(_TV __a, _TV __b)
    * with the elements from applying this function recursively to @p __rest.
    *
    * @pre _N0 <= __width_of<_TV0> && _N1 <= __width_of<_TV1> && _Ns <= 
__width_of<_TVs> && ...
+   *
+   * Strategy: Aim for a power-of-2 tree concat. E.g.
+   * - cat(2, 2, 2, 2) -> cat(4, 2, 2) -> cat(4, 4)
+   * - cat(2, 2, 2, 2, 8) -> cat(4, 2, 2, 8) -> cat(4, 4, 8) -> cat(8, 8)
    */
+  template <int _N0, int _N1, int... _Ns, __vec_builtin _TV0, __vec_builtin 
_TV1,
+          __vec_builtin... _TVs>
+    [[__gnu__::__always_inline__]]
+    constexpr __vec_builtin_type<__vec_value_type<_TV0>,
+                                __bit_ceil(unsigned(_N0 + (_N1 + ... + 
_Ns)))>
+    __vec_concat_sized(const _TV0& __a, const _TV1& __b, const _TVs&... 
__rest);
+
+  template <int _N0, int _N1, int _N2, int... _Ns, __vec_builtin _TV0, 
__vec_builtin _TV1,
+           __vec_builtin _TV2, __vec_builtin... _TVs>
+    requires (__has_single_bit(unsigned(_N0))) && (_N0 >= (_N1 + _N2))
+    [[__gnu__::__always_inline__]]
+    constexpr __vec_builtin_type<__vec_value_type<_TV0>,
+                                __bit_ceil(unsigned(_N0 + _N1 + (_N2 + ... + 
_Ns)))>
+    __vec_concat_sized(const _TV0& __a, const _TV1& __b, const _TV2& __c, 
const _TVs&... __rest)
+    {
+      return __vec_concat_sized<_N0, _N1 + _N2, _Ns...>(
+              __a, __vec_concat_sized<_N1, _N2>(__b, __c), __rest...);
+    }
+
   template <int _N0, int _N1, int... _Ns, __vec_builtin _TV0, __vec_builtin 
_TV1,
           __vec_builtin... _TVs>
     [[__gnu__::__always_inline__]]
-- 
──────────────────────────────────────────────────────────────────────────
 Dr. Matthias Kretz                           https://mattkretz.github.io
 GSI Helmholtz Center for Heavy Ion Research               https://gsi.de
 std::simd
──────────────────────────────────────────────────────────────────────────

signature.asc
Description: This is a digitally signed message part.

[PATCH] libstdc++: Improve simd::cat implementation strategy

Reply via email to