https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87718
--- Comment #4 from Terry Guo <xuepeng.guo at intel dot com> ---
(In reply to Uroš Bizjak from comment #2)
> Following testcase:
>
> --cut here--
> typedef int V __attribute__((vector_size (8)));
>
> void foo (int x, int y)
> {
> register int a __asm ("xmm1");
> register int b __asm ("xmm2");
> register V c __asm ("xmm3");
> a = x;
> b = y;
> asm volatile ("" : "+v" (a), "+v" (b));
> c = (V) { a, b };
> asm volatile ("" : "+v" (c));
> }
> --cut here--
>
> gets compiled with -O2 -mavx -mtune=intel:
>
> vmovd %edi, %xmm1
> vmovd %esi, %xmm2
> vmovd %xmm2, %eax
> vpinsrd $1, %eax, %xmm1, %xmm3
> ret
>
> The relevant pattern is defined as:
>
> (define_insn "*vec_concatv2si_sse4_1"
> [(set (match_operand:V2SI 0 "register_operand"
> "=Yr,*x, x, v,Yr,*x, v, v, *y,*y")
> (vec_concat:V2SI
> (match_operand:SI 1 "nonimmediate_operand"
> " 0, 0, x,Yv, 0, 0,Yv,rm, 0,rm")
> (match_operand:SI 2 "nonimm_or_0_operand"
> " rm,rm,rm,rm,Yr,*x,Yv, C,*ym, C")))]
> "TARGET_SSE4_1 && !(MEM_P (operands[1]) && MEM_P (operands[2]))"
> "@
> pinsrd\t{$1, %2, %0|%0, %2, 1}
> pinsrd\t{$1, %2, %0|%0, %2, 1}
> vpinsrd\t{$1, %2, %1, %0|%0, %1, %2, 1}
> vpinsrd\t{$1, %2, %1, %0|%0, %1, %2, 1}
> punpckldq\t{%2, %0|%0, %2}
> punpckldq\t{%2, %0|%0, %2}
> vpunpckldq\t{%2, %1, %0|%0, %1, %2}
> %vmovd\t{%1, %0|%0, %1}
> punpckldq\t{%2, %0|%0, %2}
> movd\t{%1, %0|%0, %1}"
>
> but for some reason RA chooses alternative 2 (x<-x,rm) instead of
> alternative 6 (v<-Yv,Yv), although alternative 2 needs an extra reload from
> %xmm2 to %eax.
I dig this a bit and looks like we missed something in combine pass, hence fail
to get a pattern that can match alternative 6. The combine pass dump of old gcc
shows:
-------------------
REG_UNUSED flags:CC
insn_cost 4 for 10: r82:SI=xmm16:SI
REG_DEAD xmm16:SI
insn_cost 4 for 11: r83:SI=xmm17:SI
REG_DEAD xmm17:SI
insn_cost 4 for 12: r87:V2SI=vec_concat(r82:SI,r83:SI)
REG_DEAD r83:SI
REG_DEAD r82:SI
-------------------
then we got:
-------------------
Trying 10 -> 12:
10: r82:SI=xmm16:SI
REG_DEAD xmm16:SI
12: r87:V2SI=vec_concat(r82:SI,r83:SI)
REG_DEAD r83:SI
REG_DEAD r82:SI
Successfully matched this instruction:
(set (reg:V2SI 87)
(vec_concat:V2SI (reg/v:SI 52 xmm16 [ a ])
(reg:SI 83 [ b.1_2 ])))
allowing combination of insns 10 and 12
original costs 4 + 4 = 8
replacement cost 4
deferring deletion of insn with uid = 10.
modifying insn i3 12: r87:V2SI=vec_concat(xmm16:SI,r83:SI)
REG_DEAD xmm16:SI
REG_DEAD r83:SI
deferring rescan insn with uid = 12.
Trying 11 -> 12:
11: r83:SI=xmm17:SI
REG_DEAD xmm17:SI
12: r87:V2SI=vec_concat(xmm16:SI,r83:SI)
REG_DEAD xmm16:SI
REG_DEAD r83:SI
Successfully matched this instruction:
(set (reg:V2SI 87)
(vec_concat:V2SI (reg/v:SI 52 xmm16 [ a ])
(reg/v:SI 53 xmm17 [ b ])))
allowing combination of insns 11 and 12
original costs 4 + 4 = 8
replacement cost 4
deferring deletion of insn with uid = 11.
modifying insn i3 12: r87:V2SI=vec_concat(xmm16:SI,xmm17:SI)
REG_DEAD xmm17:SI
REG_DEAD xmm16:SI
deferring rescan insn with uid = 12.
-------------------
There are two successful combine attempts. We end up with pattern that can
match alternative 6.
However dump from current GCC trunk shows:
-------------------
insn_cost 4 for 19: r90:SI=xmm16:SI
REG_DEAD xmm16:SI
insn_cost 4 for 10: r82:SI=r90:SI
REG_DEAD r90:SI
insn_cost 4 for 20: r91:SI=xmm17:SI
REG_DEAD xmm17:SI
insn_cost 4 for 11: r83:SI=r91:SI
REG_DEAD r91:SI
insn_cost 4 for 12: r87:V2SI=vec_concat(r82:SI,r83:SI)
REG_DEAD r83:SI
REG_DEAD r82:SI
insn_cost 4 for 13: xmm3:V2SI=r87:V2SI
REG_DEAD r87:V2SI
-------------------
Trying 11 -> 12:
11: r83:SI=r91:SI
REG_DEAD r91:SI
12: r87:V2SI=vec_concat(r90:SI,r83:SI)
REG_DEAD r90:SI
REG_DEAD r83:SI
Successfully matched this instruction:
(set (reg:V2SI 87)
(vec_concat:V2SI (reg:SI 90)
(reg:SI 91)))
allowing combination of insns 11 and 12
original costs 4 + 4 = 8
replacement cost 4
deferring deletion of insn with uid = 11.
modifying insn i3 12: r87:V2SI=vec_concat(r90:SI,r91:SI)
REG_DEAD r91:SI
REG_DEAD r90:SI
deferring rescan insn with uid = 12.
-------------------
We end up with "12: r87:V2SI=vec_concat(r90:SI,r91:SI)", later in LRA pass, the
operand r90 is replaced with XMM register, the r91 is kept as general register.
Then no chance match against preferred alternative 6.