https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125880

            Bug ID: 125880
           Summary: byte and word memory move to %xmm should use
                    pinsr{b,w}
           Product: gcc
           Version: 17.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rguenth at gcc dot gnu.org
  Target Milestone: ---
            Target: x86_64-*-*

void foo (short * __restrict p, short *q, int s, int n)
{
  for (int i = 0; i < n; ++i)
    {
      p[i] = q[s*i];
    }
}

generates with -fno-vect-cost-model -msse4:

        movzwl  (%rax,%r12), %ecx
        movd    %ecx, %xmm0
        pinsrw  $1, (%rax), %xmm0
...

but I think it should be better to generate

        xorps %xmm0, %xmm0
        pinsrw  $0, (%rax,%r12), %xmm0
        pinsrw  $1, (%rax), %xmm0

because that a) avoids the GPR, b) on AMD uarchs avoids one uop and a cross
register file transfer.  The GPR path contests for different ports there
though.

A similar

void foo (char * __restrict p, char *q, int s, int n)
{
  for (int i = 0; i < n; ++i)
    {
      p[2*i] = q[s*2*i];
      p[2*i + 1] = q[s*(2*i + 1)];
    }
}

gets

        movzbl  (%rdx,%rdi), %ecx
        movd    %ecx, %xmm0
        pinsrb  $1, (%rax,%rdi), %xmm0

there's also the possibility to merge up to HI/SI/DImode with GPRs using
movz{b,w}l to GPRs and shift/ior (possibly when pinsr{b,w} is not available).

For the cases above the code comes from the vec_init expander but I can
imagine this might be too early for a perfect decision.

Reply via email to