Re: [PATCH] libstdc++: Improve simd::cat implementation strategy

Tomasz Kaminski Sun, 22 Feb 2026 23:14:31 -0800

On Fri, Feb 20, 2026 at 11:30 AM Matthias Kretz <[email protected]>
wrote:


> This goes on top of the main [simd] patch and is in preparation of an
> implementation for [simd.permute.dynamic].
>
> How could/should I test this change? I have a test that compiles
>
I think we are good with tests that only check if the result is correct,
i.e. elements have correct values.

>
> auto f10(simd::vec<short, 2> a, simd::vec<short, 2> b,
>          simd::vec<short, 2> c, simd::vec<short, 2> d,
>          simd::vec<short, 8> e) {
>   g(simd::cat(a, b, c, d, e));
> }
>
> to
>         vmovdqa xmm3, xmm0
>         vmovd   xmm1, esi
>         vmovd   xmm0, edi
>         vinsertps       xmm0, xmm0, xmm1, 16
>         vmovd   xmm2, ecx
>         vmovd   xmm1, edx
>         vinsertps       xmm1, xmm1, xmm2, 16
>         vmovq   xmm0, xmm0
>         vmovq   xmm1, xmm1
>         vpunpcklqdq     xmm0, xmm0, xmm1
>         vmovdqa xmm1, xmm3
>         vmovdqa xmm0, xmm0
>         vperm2i128      ymm0, ymm0, ymm1, 32
>
> The 2x insertps -> unpck -> 128-bit concat sequence shows it's doing the
> expected sequence. But I really hope this will turn into
>         vmovd   xmm2, edi
>         vmovd   xmm1, edx
>         vpinsrd xmm2, xmm2, esi, 1
>         vpinsrd xmm1, xmm1, ecx, 1
>         vpunpcklqdq     xmm1, xmm2, xmm1
>         vinserti128     ymm0, ymm1, xmm0, 1
>
> at some point. I don't think we want such detailed code-gen tests here. A
> test
> like this belongs into the gcc/testsuite.
>
> ---
>
> cat(a, b, c, d) where each argument is e.g. 2 elements wide, would fold
> from the left before this change: (2, 2, 2, 2) -> (4, 2, 2) -> (6, 2) ->
> (8). It is better for ILP (and to avoid load and store instructions) to
> go via (4, 2, 2) -> (4, 4) -> (8).
>
> In theory, for even larger number of arguments, the current
> implementation still isn't good enough. But larger number of arguments
> is something users shouldn't be doing anyway.
>
> Signed-off-by: Matthias Kretz <[email protected]>
>
> libstdc++-v3/ChangeLog:
>
>         * include/bits/vec_ops.h (__vec_concat_sized): Add an overload
>         that concatenates the second and third operand, if they are
>         smaller than the first.
> ---
>  libstdc++-v3/include/bits/vec_ops.h | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
>
> diff --git a/libstdc++-v3/include/bits/vec_ops.h
> b/libstdc++-v3/include/bits/
> vec_ops.h
> index 0e89c89b7af5..e5bf2f1497cd 100644
> --- a/libstdc++-v3/include/bits/vec_ops.h
> +++ b/libstdc++-v3/include/bits/vec_ops.h
> @@ -187,7 +187,30 @@ __vec_concat(_TV __a, _TV __b)
>     * with the elements from applying this function recursively to @p
> __rest.
>     *
>     * @pre _N0 <= __width_of<_TV0> && _N1 <= __width_of<_TV1> && _Ns <=
> __width_of<_TVs> && ...
> +   *
> +   * Strategy: Aim for a power-of-2 tree concat. E.g.
> +   * - cat(2, 2, 2, 2) -> cat(4, 2, 2) -> cat(4, 4)
> +   * - cat(2, 2, 2, 2, 8) -> cat(4, 2, 2, 8) -> cat(4, 4, 8) -> cat(8, 8)
>     */
> +  template <int _N0, int _N1, int... _Ns, __vec_builtin _TV0,
> __vec_builtin
> _TV1,
> +          __vec_builtin... _TVs>
> +    [[__gnu__::__always_inline__]]
> +    constexpr __vec_builtin_type<__vec_value_type<_TV0>,
> +                                __bit_ceil(unsigned(_N0 + (_N1 + ... +
> _Ns)))>
> +    __vec_concat_sized(const _TV0& __a, const _TV1& __b, const _TVs&...
> __rest);
> +
> +  template <int _N0, int _N1, int _N2, int... _Ns, __vec_builtin _TV0,
> __vec_builtin _TV1,
> +           __vec_builtin _TV2, __vec_builtin... _TVs>
> +    requires (__has_single_bit(unsigned(_N0))) && (_N0 >= (_N1 + _N2))
> +    [[__gnu__::__always_inline__]]
> +    constexpr __vec_builtin_type<__vec_value_type<_TV0>,
> +                                __bit_ceil(unsigned(_N0 + _N1 + (_N2 +
> ... +
> _Ns)))>
> +    __vec_concat_sized(const _TV0& __a, const _TV1& __b, const _TV2& __c,
> const _TVs&... __rest)
>
I do not think that this should be a separate overload and another if
constexpr branch
In the default implementation.

+    {
> +      return __vec_concat_sized<_N0, _N1 + _N2, _Ns...>(
> +              __a, __vec_concat_sized<_N1, _N2>(__b, __c), __rest...);
> +    }
> +
>    template <int _N0, int _N1, int... _Ns, __vec_builtin _TV0,
> __vec_builtin
> _TV1,
>            __vec_builtin... _TVs>
>      [[__gnu__::__always_inline__]]
> --
> ──────────────────────────────────────────────────────────────────────────
>  Dr. Matthias Kretz                           https://mattkretz.github.io
>  GSI Helmholtz Center for Heavy Ion Research               https://gsi.de
>  std::simd
> ──────────────────────────────────────────────────────────────────────────
>
>

Re: [PATCH] libstdc++: Improve simd::cat implementation strategy

Reply via email to