https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104306
Bug ID: 104306 Summary: Use secondary_reload for optimized interunit reg-reg moves Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ubizjak at gmail dot com Target Milestone: --- Created attachment 52318 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52318&action=edit Prototype patch I was looking how to avoid STLF failure in the following test: __uint128_t __attribute__((noinline)) foo (__uint128_t a) { __uint128_t r = a; __asm ("" : "+x" (r)); return r; } Currently, the above test is compiled (-O2), without -msse4 to: foo: movq %rdi, -24(%rsp) movq %rsi, -16(%rsp) movdqa -24(%rsp), %xmm0 movaps %xmm0, -24(%rsp) movq -24(%rsp), %rax movq -16(%rsp), %rdx ret Somehow un-optimal, fails store-to-load forwarding. I was trying to exercise secondary reload functionality (see the attached patch) that introduced a specialized patterns to perform the move: +(define_expand "reload_movti_from_xmm" + [(parallel [(match_operand 0 "register_operand" "=r") + (match_operand 1 "register_operand" "x") + (match_operand:TI 2 "register_operand" "=&x")])] + "TARGET_64BIT && TARGET_SSE2" +{ + emit_move_insn (gen_lowpart (DImode, operands[0]), + gen_lowpart (DImode, operands[1])); + + emit_insn (gen_sse2_pshufd + (gen_lowpart (V4SImode, operands[2]), + gen_lowpart (V4SImode, operands[1]), + GEN_INT (78))); + + emit_move_insn (gen_highpart (DImode, operands[0]), + gen_lowpart (DImode, operands[2])); + DONE; +}) + +(define_expand "reload_movti_to_xmm" + [(parallel [(match_operand 0 "register_operand" "=x") + (match_operand 1 "register_operand" "r") + (match_operand:TI 2 "register_operand" "=&x")])] + "TARGET_64BIT && TARGET_SSE2" +{ + rtx hi = gen_highpart (DImode, operands[1]); + rtx lo = gen_lowpart (DImode, operands[1]); + + emit_move_insn (gen_lowpart (DImode, operands[0]), lo); + emit_move_insn (gen_lowpart (DImode, operands[2]), hi); + + emit_insn (gen_vec_interleave_lowv2di + (gen_lowpart (V2DImode, operands[0]), + gen_lowpart (V2DImode, operands[0]), + gen_lowpart (V2DImode, operands[2]))); + DONE; +}) However, the solution semi-worked and produced the following code: foo: movq %rdi, -24(%rsp) movq -24(%rsp), %xmm0 movq %rsi, -16(%rsp) movq -16(%rsp), %xmm1 punpcklqdq %xmm1, %xmm0 movaps %xmm0, -24(%rsp) pshufd $78, -24(%rsp), %xmm2 movq -24(%rsp), %rax movq %xmm2, %rdx ret I am probably abusing the above functionality by adding the following to ix86_secondary_reload: + if (mode == TImode + && TARGET_64BIT && !TARGET_SSE4_1 + && in_p && REG_P (x)) + { + if (INTEGER_CLASS_P (rclass)) + // && SSE_REGNO_P (true_regnum (x))) + { + sri->icode = CODE_FOR_reload_movti_from_xmm; + sri->extra_cost = 1; + return NO_REGS; + } + else if (SSE_CLASS_P (rclass)) + // && GENERAL_REGNO_P (true_regnum (x))) + { + sri->icode = CODE_FOR_reload_movti_to_xmm; + sri->extra_cost = 1; + return NO_REGS; + } + } I'd expect that the RA allocates a SSE register and *then* call ix86_secondary_reload. Currently, true_regnum (x) always returns -1, indicating that the operand is in memory. And indeed, the above asm hints this is the case. The reload starts with: (insn 26 7 11 2 (set (reg:TI 87 [ r ]) (reg/v:TI 83 [ a ])) "ti.c":6:3 79 {*movti_internal} (expr_list:REG_DEAD (reg/v:TI 83 [ a ]) (nil))) and then allocates (reg 94) of class NO_REGS: Creating newreg=94, assigning class NO_REGS to secondary r94 26: r94:TI=r83:TI REG_DEAD r83:TI Inserting the sec. move after: 28: r87:TI=r94:TI Creating newreg=95, assigning class SSE_REGS to scratch r95 28: r87:TI=r94:TI Inserting the move before: 29: r87:TI#0=r94:TI#0 30: r95:TI#0=r94:TI#8 31: r87:TI#0=vec_select(vec_concat(r87:TI#0,r95:TI#0),parallel) If the allocator passed the regnum of the pseudo, then the memory can be avoided.