https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125880
Bug ID: 125880
Summary: byte and word memory move to %xmm should use
pinsr{b,w}
Product: gcc
Version: 17.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: rguenth at gcc dot gnu.org
Target Milestone: ---
Target: x86_64-*-*
void foo (short * __restrict p, short *q, int s, int n)
{
for (int i = 0; i < n; ++i)
{
p[i] = q[s*i];
}
}
generates with -fno-vect-cost-model -msse4:
movzwl (%rax,%r12), %ecx
movd %ecx, %xmm0
pinsrw $1, (%rax), %xmm0
...
but I think it should be better to generate
xorps %xmm0, %xmm0
pinsrw $0, (%rax,%r12), %xmm0
pinsrw $1, (%rax), %xmm0
because that a) avoids the GPR, b) on AMD uarchs avoids one uop and a cross
register file transfer. The GPR path contests for different ports there
though.
A similar
void foo (char * __restrict p, char *q, int s, int n)
{
for (int i = 0; i < n; ++i)
{
p[2*i] = q[s*2*i];
p[2*i + 1] = q[s*(2*i + 1)];
}
}
gets
movzbl (%rdx,%rdi), %ecx
movd %ecx, %xmm0
pinsrb $1, (%rax,%rdi), %xmm0
there's also the possibility to merge up to HI/SI/DImode with GPRs using
movz{b,w}l to GPRs and shift/ior (possibly when pinsr{b,w} is not available).
For the cases above the code comes from the vec_init expander but I can
imagine this might be too early for a perfect decision.