[Bug target/109944] New: vector CTOR with byte elements and SSE2 has STLF fail

rguenth at gcc dot gnu.org via Gcc-bugs Tue, 23 May 2023 06:48:32 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109944


            Bug ID: 109944
           Summary: vector CTOR with byte elements and SSE2 has STLF fail
           Product: gcc
           Version: 13.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rguenth at gcc dot gnu.org
  Target Milestone: ---

I've experimented with CTORs from smaller elements and byte handling with
plain SSE2 is quite bad (for word we have pinsrw).

void foo(char *a, char *m, char d, char e)
{
  char b = *m;
  char c = m[2];
  a[0] = b;
  a[1] = c;
  a[2] = d;
  a[3] = e;
  a[4] = b;
  a[5] = c;
  a[6] = b;
  a[7] = c;
  a[8] = b;
  a[9] = c;
  a[10] = b;
  a[11] = c;
  a[12] = b;
  a[13] = c;
  a[14] = b;
  a[15] = c;
}

generates

        movzbl  2(%rsi), %r8d
        movl    %edx, %r9d
        movzbl  (%rsi), %edx
        movzbl  %cl, %ecx
        movzbl  %r9b, %r9d
        movq    %r8, %rax
        salq    $8, %rax
        orq     %rdx, %rax
        salq    $8, %rax
        orq     %r8, %rax
        salq    $8, %rax
        orq     %rdx, %rax
        salq    $8, %rax
        orq     %rax, %rcx
        orq     %r8, %rax
        salq    $8, %rcx
        salq    $8, %rax
        orq     %r9, %rcx
        orq     %rdx, %rax
        salq    $8, %rcx
        salq    $8, %rax
        orq     %r8, %rcx
        orq     %r8, %rax
        salq    $8, %rcx
        salq    $8, %rax
        orq     %rdx, %rcx
        orq     %rdx, %rax
        movq    %rcx, -24(%rsp)
        movq    %rax, -16(%rsp)
        movdqa  -24(%rsp), %xmm0
        movups  %xmm0, (%rdi)

while we can handle a splat from QImode via

        movzbl  (%rsi), %eax
        movd    %eax, %xmm0
        punpcklbw       %xmm0, %xmm0
        punpcklwd       %xmm0, %xmm0
        pshufd  $0, %xmm0, %xmm0
        movups  %xmm0, (%rdi)

I think we can go and for a generic V16QImode CTOR and SSE2 create two
V8HImode vectors using pinsrw, for the first from zero-extended QImode
values of the even elements and for the second from zero-extended and
left-shifted values of the odd elements and then IOR the two vectors.

Alternatively the above needs to be pessimized better in the cost model.

Btw, for HImode elements I see we do

        movzwl  (%rsi), %eax
        movd    %eax, %xmm0
        movdqa  %xmm0, %xmm1
        movdqa  %xmm0, %xmm2
        pinsrw  $1, 4(%rsi), %xmm1
...

not sure why we don't do

        pxor %xmm1, %xmm1
        pinsrw  $0, (%rsi), %xmm1

and thus avoid the round-trip through the GPR for the initial element?

[Bug target/109944] New: vector CTOR with byte elements and SSE2 has STLF fail

Reply via email to