[Bug rtl-optimization/104306] New: Use secondary_reload for optimized interunit reg-reg moves

ubizjak at gmail dot com via Gcc-bugs Mon, 31 Jan 2022 05:10:20 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104306


            Bug ID: 104306
           Summary: Use secondary_reload for optimized interunit reg-reg
                    moves
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ubizjak at gmail dot com
  Target Milestone: ---

Created attachment 52318
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52318&action=edit
Prototype patch

I was looking how to avoid STLF failure in the following test:

__uint128_t
__attribute__((noinline))
foo (__uint128_t a)
{
  __uint128_t r = a;
  __asm ("" : "+x" (r));

  return r;
}


Currently, the above test is compiled (-O2), without -msse4 to:

foo:
        movq    %rdi, -24(%rsp)
        movq    %rsi, -16(%rsp)
        movdqa  -24(%rsp), %xmm0

        movaps  %xmm0, -24(%rsp)
        movq    -24(%rsp), %rax
        movq    -16(%rsp), %rdx
        ret

Somehow un-optimal, fails store-to-load forwarding.

I was trying to exercise secondary reload functionality (see the attached
patch) that introduced a specialized patterns to perform the move:

+(define_expand "reload_movti_from_xmm"
+  [(parallel [(match_operand 0 "register_operand" "=r")
+              (match_operand 1 "register_operand" "x")
+              (match_operand:TI 2 "register_operand" "=&x")])]
+  "TARGET_64BIT && TARGET_SSE2"
+{
+  emit_move_insn (gen_lowpart (DImode, operands[0]),
+                 gen_lowpart (DImode, operands[1]));
+
+  emit_insn (gen_sse2_pshufd
+            (gen_lowpart (V4SImode, operands[2]),
+             gen_lowpart (V4SImode, operands[1]),
+             GEN_INT (78)));
+
+  emit_move_insn (gen_highpart (DImode, operands[0]),
+                 gen_lowpart (DImode, operands[2]));
+  DONE;
+})
+
+(define_expand "reload_movti_to_xmm"
+  [(parallel [(match_operand 0 "register_operand" "=x")
+              (match_operand 1 "register_operand" "r")
+              (match_operand:TI 2 "register_operand" "=&x")])]
+  "TARGET_64BIT && TARGET_SSE2"
+{
+  rtx hi = gen_highpart (DImode, operands[1]);
+  rtx lo = gen_lowpart (DImode, operands[1]);
+
+  emit_move_insn (gen_lowpart (DImode, operands[0]), lo);
+  emit_move_insn (gen_lowpart (DImode, operands[2]), hi);
+
+  emit_insn (gen_vec_interleave_lowv2di
+            (gen_lowpart (V2DImode, operands[0]),
+             gen_lowpart (V2DImode, operands[0]),
+             gen_lowpart (V2DImode, operands[2])));
+  DONE;
+})

However, the solution semi-worked and produced the following code:

foo:
        movq    %rdi, -24(%rsp)
        movq    -24(%rsp), %xmm0
        movq    %rsi, -16(%rsp)
        movq    -16(%rsp), %xmm1
        punpcklqdq      %xmm1, %xmm0
        movaps  %xmm0, -24(%rsp)
        pshufd  $78, -24(%rsp), %xmm2
        movq    -24(%rsp), %rax
        movq    %xmm2, %rdx
        ret

I am probably abusing the above functionality by adding the following to
ix86_secondary_reload:

+  if (mode == TImode
+      && TARGET_64BIT && !TARGET_SSE4_1
+      && in_p && REG_P (x))
+    {
+      if (INTEGER_CLASS_P (rclass))
+       //        && SSE_REGNO_P (true_regnum (x)))
+       {
+         sri->icode = CODE_FOR_reload_movti_from_xmm;
+         sri->extra_cost = 1;
+         return NO_REGS;
+       }
+      else if (SSE_CLASS_P (rclass))
+       //             && GENERAL_REGNO_P (true_regnum (x)))
+       {
+         sri->icode = CODE_FOR_reload_movti_to_xmm;
+         sri->extra_cost = 1;
+         return NO_REGS;
+       }
+    }

I'd expect that the RA allocates a SSE register and *then* call
ix86_secondary_reload. Currently, true_regnum (x) always returns -1, indicating
that the operand is in memory. And indeed, the above asm hints this is the
case.

The reload starts with:

(insn 26 7 11 2 (set (reg:TI 87 [ r ])
        (reg/v:TI 83 [ a ])) "ti.c":6:3 79 {*movti_internal}
     (expr_list:REG_DEAD (reg/v:TI 83 [ a ])
        (nil)))

and then allocates (reg 94) of class NO_REGS:

      Creating newreg=94, assigning class NO_REGS to secondary r94
   26: r94:TI=r83:TI
      REG_DEAD r83:TI
    Inserting the sec. move after:
   28: r87:TI=r94:TI

      Creating newreg=95, assigning class SSE_REGS to scratch r95
   28: r87:TI=r94:TI
    Inserting the move before:
   29: r87:TI#0=r94:TI#0
   30: r95:TI#0=r94:TI#8
   31: r87:TI#0=vec_select(vec_concat(r87:TI#0,r95:TI#0),parallel)

If the allocator passed the regnum of the pseudo, then the memory can be
avoided.

[Bug rtl-optimization/104306] New: Use secondary_reload for optimized interunit reg-reg moves

Reply via email to