On Mon, Oct 20, 2025 at 5:59 PM Roger Sayle <[email protected]> wrote:
>
>
> Hi Uros and H.J.,
> Here's an old patch that I never got around to posting due to stage
> restrictions last year (or the year before).
>
> Currently x86_64's TImode STV pass has the restriction that candidate
> chains must start with a TImode load from memory.  This patch improves
> the functionality of STV to allow zero-extensions and construction of
> TImode pseudos from two DImode values (i.e. *concatditi) to both be
> considered candidate chain initiators.  For example, this allows chains
> starting from an __int128 function argument to be processed by STV.
>
> Compiled with -O2 on x86_64:
>
> __int128 m0,m1,m2,m3;
> void foo(__int128 m)
> {
>     m0 = m;
>     m1 = m;
>     m2 = m;
>     m3 = m;
> }
>
> Previously generated:
>
> foo:    xchgq   %rdi, %rsi
>         movq    %rsi, m0(%rip)
>         movq    %rdi, m0+8(%rip)
>         movq    %rsi, m1(%rip)
>         movq    %rdi, m1+8(%rip)
>         movq    %rsi, m2(%rip)
>         movq    %rdi, m2+8(%rip)
>         movq    %rsi, m3(%rip)
>         movq    %rdi, m3+8(%rip)
>         ret
>
> With the patch, we now generate:
>
> foo:    movq    %rdi, %xmm0
>         movq    %rsi, %xmm1
>         punpcklqdq      %xmm1, %xmm0
>         movaps  %xmm0, m0(%rip)
>         movaps  %xmm0, m1(%rip)
>         movaps  %xmm0, m2(%rip)
>         movaps  %xmm0, m3(%rip)
>         ret
>
> or with -mavx2:
>
> foo:    vmovq   %rdi, %xmm1
>         vpinsrq $1, %rsi, %xmm1, %xmm0
>         vmovdqa %xmm0, m0(%rip)
>         vmovdqa %xmm0, m1(%rip)
>         vmovdqa %xmm0, m2(%rip)
>         vmovdqa %xmm0, m3(%rip)
>         ret
>
> Likewise, for zero-extension:
>
> __int128 m0,m1,m2,m3;
> void bar(unsigned long x)
> {
>     __int128 m = x;
>     m0 = m;
>     m1 = m;
>     m2 = m;
>     m3 = m;
> }
>
> Previously with -O2:
>
> bar:    movq    %rdi, m0(%rip)
>         movq    $0, m0+8(%rip)
>         movq    %rdi, m1(%rip)
>         movq    $0, m1+8(%rip)
>         movq    %rdi, m2(%rip)
>         movq    $0, m2+8(%rip)
>         movq    %rdi, m3(%rip)
>         movq    $0, m3+8(%rip)
>         ret
>
> with this patch:
>
> bar:    movq    %rdi, %xmm0
>         movaps  %xmm0, m0(%rip)
>         movaps  %xmm0, m1(%rip)
>         movaps  %xmm0, m2(%rip)
>         movaps  %xmm0, m3(%rip)
>         ret
>
>
> As shown in the examples above, the scalar-to-vector (STV) conversion of
> *concatditi has an overhead [treating two DImode registers as a TImode
> value is free on x86_64], but specifying this penalty allows the STV
> pass to make an informed decision if the total cost/gain of the chain
> is a net win.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2025-10-20  Roger Sayle  <[email protected]>
>
> gcc/ChangeLog
>         * config/i386/i386-features.cc (timode_concatdi_p): New
>         function to recognize the various variants of *concatditi3_[1-7].
>         function to determine the gain/cost on a CONST_WIDE_INT.
>         (scalar_chain::add_insn): Like VEC_SELECT, ZERO_EXTEND and
>         timode_concatdi_p instructions don't require their input
>         operands to be converted (to TImode).
>         (timode_scalar_chain::compute_convert_gain): Split/clone XOR and
>         IOR cases from AND case, to handle timode_concatdi_p costs.
>         <case PLUS>: Handle timode_concatdi_p conversion costs.
>         <case ZERO_EXTEND>: Provide costs of DImode to TImode extension.
>         (timode_convert_concatdi): Helper function to transform a
>         *concatditi3 instruction into a vec_concatv2di instruction.
>         (timode_scalar_chain::convert_insn): Split/clone XOR and IOR
>         cases from ANS case, to handle timode_concatdi_p using the new
>         timode_convert_concatdi helper function.
>         <case ZERO_EXTEND>: Convert zero_extendditi2 to *vec_concatv2di_0.
>         <case PLUS>: Handle timode_concatdi_p using the new
>         timode_convert_concatdi helper function.
>         (timode_scalar_to_vector_candidate_p): Support timode_concatdi_p
>         instructions in IOR, XOR and PLUS cases.
>         <case ZERO_EXTEND>: Consider zero extension of a register from
>         DImode to TImode to be a candidate.
>
> gcc/testsuite/ChangeLog
>         * gcc.target/i386/sse4_1-stv-10.c: New test case.
>         * gcc.target/i386/sse4_1-stv-11.c: Likewise.
>         * gcc.target/i386/sse4_1-stv-12.c: Likewise.

I didn't check gains in compute_convert_gain in detail, but they look
reasonable, and you have much more experience here.

As shown by attached testcases, this functionality is a nice addition
to the STV pass.

The patch is OK.

Thanks,
Uros.

Reply via email to