https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123631

            Bug ID: 123631
           Summary: Odd choice for vector constant materialization
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rguenth at gcc dot gnu.org
  Target Milestone: ---

I'm seeing

void foo (int *q)
{
  q[0] = 10;
  q[1] = 10;
  q[2] = 10;
  q[3] = 10;
}

with -march=znver2

   0:   b8 0a 00 00 00          mov    $0xa,%eax
   5:   c5 f9 6e c0             vmovd  %eax,%xmm0
   9:   c4 e2 79 58 c0          vpbroadcastd %xmm0,%xmm0
   e:   c5 fa 7f 07             vmovdqu %xmm0,(%rdi)

and -march=znver4

   0:   b8 0a 00 00 00          mov    $0xa,%eax
   5:   62 f2 7d 08 7c c0       vpbroadcastd %eax,%xmm0
   b:   c5 fa 7f 07             vmovdqu %xmm0,(%rdi)

which are both larger than with a non-uniform vector constant which is
loaded from memory:

   0:   c5 f9 6f 05 00 00 00    vmovdqa 0x0(%rip),%xmm0        # 8 <foo+0x8>
   7:   00 
   8:   c5 fa 7f 07             vmovdqu %xmm0,(%rdi)

and I think also has comparable (if not lower) latency (due to GPR<->XMM move)
if in cache, for sure less uops and less port pressure.

With FP we're broadcasting from scalar memory using vbroadcastss.  For
the same sized integer data that should be possible as well, but is
one byte larger (but possibly better for dcache, esp. when broadcasting
to %ymm or %zmm).

Reply via email to