[Bug target/109944] vector CTOR with byte elements and SSE2 has STLF fail

2023-05-24 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109944

Richard Biener  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #7 from Richard Biener  ---
Summary is fixed now.  Any other changes require actual benchmarking I think.

[Bug target/109944] vector CTOR with byte elements and SSE2 has STLF fail

2023-05-24 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109944

--- Comment #6 from CVS Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:affee7dcfa1ee272d43ac7cb68cf423dbd956fd8

commit r14-1166-gaffee7dcfa1ee272d43ac7cb68cf423dbd956fd8
Author: Richard Biener 
Date:   Wed May 24 10:07:36 2023 +0200

target/109944 - avoid STLF fail for V16QImode CTOR expansion

The following dispatches to V2DImode CTOR expansion instead of
using sets of (subreg:DI (reg:V16QI 146) [08]) which causes
LRA to spill DImode and reload V16QImode.  The same applies for
V8QImode or V4HImode construction from SImode parts which happens
during 32bit libgcc build.

PR target/109944
* config/i386/i386-expand.cc (ix86_expand_vector_init_general):
Perform final vector composition using
ix86_expand_vector_init_general instead of setting
the highpart and lowpart which causes spilling.

* gcc.target/i386/pr109944-1.c: New testcase.
* gcc.target/i386/pr109944-2.c: Likewise.

[Bug target/109944] vector CTOR with byte elements and SSE2 has STLF fail

2023-05-24 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109944

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #5 from Alexander Monakov  ---
(In reply to Richard Biener from comment #3)
> so we're building SImode elements in %xmm regs and then
> unpack them - that's probably better than a series of
> pinsrw due to dependences.  For uarchs where grp->xmm
> moves are costly it might be better to do
> 
>   pxor %xmm0, %xmm0
>   pinsrw $0, (%rsi), %xmm0
>   pinsrw $1, 32(%rsi), %xmm0
> 
> though?

I'm afraid that is impossible, pinsrw will attempt to load 2 bytes, but only 1
is accessible (if at end of page).

[Bug target/109944] vector CTOR with byte elements and SSE2 has STLF fail

2023-05-24 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109944

Richard Biener  changed:

   What|Removed |Added

   Last reconfirmed||2023-05-24
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org
 Ever confirmed|0   |1
 Status|UNCONFIRMED |ASSIGNED

--- Comment #4 from Richard Biener  ---
I'm testing that.

[Bug target/109944] vector CTOR with byte elements and SSE2 has STLF fail

2023-05-24 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109944

--- Comment #3 from rguenther at suse dot de  ---
On Wed, 24 May 2023, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109944
> 
> --- Comment #2 from Hongtao.liu  ---
> > I think we can go and for a generic V16QImode CTOR and SSE2 create two
> > V8HImode vectors using pinsrw, for the first from zero-extended QImode
> > values of the even elements and for the second from zero-extended and
> > left-shifted values of the odd elements and then IOR the two vectors.
> 
> Or the backend can recognize as as a HImode(b,c) broadcast + HImode(d,e)
> vec_set(the middle end can recognize it as VEC_DUPLICATE_EXPR +
> .VEC_SET/BIT_INSERT_EXPR when available?)

Yeah.  Note we need to handle the general case with 16 distinct
elements as well.  For simplicity I'm using a memory input below
instead of 16 function parameters.

void foo (char * __restrict a, char *b)
{
  a[0] = b[0];
  a[1] = b[16];
  a[2] = b[32];
  a[3] = b[48];
  a[4] = b[64];
  a[5] = b[80];
  a[6] = b[96];
  a[7] = b[112];
  a[8] = b[128];
  a[9] = b[144];
  a[10] = b[160];
  a[11] = b[176];
  a[12] = b[192];
  a[13] = b[208];
  a[14] = b[224];
  a[15] = b[240];
}

with -O2 generates

foo:
.LFB0:
.cfi_startproc
movzbl  112(%rsi), %edx
movzbl  96(%rsi), %eax
movzbl  224(%rsi), %r8d
movzbl  (%rsi), %ecx
salq$8, %rdx
orq %rax, %rdx
movzbl  80(%rsi), %eax
salq$8, %rdx
orq %rax, %rdx
movzbl  64(%rsi), %eax
... more of that ...
orq %r8, %rax
movzbl  144(%rsi), %r8d
movzbl  128(%rsi), %esi
salq$8, %rax
orq %r8, %rax
salq$8, %rax
orq %rsi, %rax
movq%rax, -16(%rsp)
movdqa  -24(%rsp), %xmm0
movups  %xmm0, (%rdi)
ret

so a way is to form HImode elements in GPRs by shift and or
and then build a V8HImode vector from that.  Note
code generation for a V8HImode CTOR also looks imperfect
(change char to short and only do elements 0 to 7 above):

foo:
.LFB0:
.cfi_startproc
movzwl  (%rsi), %eax
movd%eax, %xmm0
movzwl  64(%rsi), %eax
pinsrw  $1, 32(%rsi), %xmm0
movd%eax, %xmm3
movzwl  128(%rsi), %eax
pinsrw  $1, 96(%rsi), %xmm3
movd%eax, %xmm1
movzwl  192(%rsi), %eax
punpckldq   %xmm3, %xmm0
pinsrw  $1, 160(%rsi), %xmm1
movd%eax, %xmm2
pinsrw  $1, 224(%rsi), %xmm2
punpckldq   %xmm2, %xmm1
punpcklqdq  %xmm1, %xmm0
movups  %xmm0, (%rdi)
ret

so we're building SImode elements in %xmm regs and then
unpack them - that's probably better than a series of
pinsrw due to dependences.  For uarchs where grp->xmm
moves are costly it might be better to do

  pxor %xmm0, %xmm0
  pinsrw $0, (%rsi), %xmm0
  pinsrw $1, 32(%rsi), %xmm0

though?  pinsr* are not especially fast (2uops on zen,
latency 3, throughput 1 - on zen4 it got worse), so maybe
forming SImode in GPRs might be better for them (or mixing
both to better utilize execution resources).  For the 16
or 8 (distinct) element CTORs there's hardly surrounding
code we can hope to execute in parallel.  I wonder if we
can easily determine in the expander whether we deal with
elements that are nicely available in GPRs or in XMMs
or whether we need to deal with wrong choices later,
for example in STV.

But of course first we need to avoid spill/reload.
That's from ix86_expand_vector_init_general doing

  else if (n_words == 2)
{
  rtx tmp = gen_reg_rtx (mode);
  emit_clobber (tmp);
  emit_move_insn (gen_lowpart (tmp_mode, tmp), words[0]);
  emit_move_insn (gen_highpart (tmp_mode, tmp), words[1]);
  emit_move_insn (target, tmp);

which generates (subreg:DI (reg:V16QI 146) 0) and
(subreg:DI (reg:V16QI 146) 8) and then

(insn 52 51 53 2 (parallel [
(set (subreg:DI (reg:V16QI 146) 0)
(ior:DI (reg:DI 122) 
(reg:DI 121 [ *b_18(D) ])))
(clobber (reg:CC 17 flags))
]) "t.c":3:8 612 {*iordi_1}
 (expr_list:REG_DEAD (reg:DI 122)
(expr_list:REG_DEAD (reg:DI 121 [ *b_18(D) ])
(expr_list:REG_UNUSED (reg:CC 17 flags)
(nil)
(insn 53 52 55 2 (parallel [
(set (subreg:DI (reg:V16QI 146) 8)
(ior:DI (reg:DI 144)
(reg:DI 143 [ MEM[(char *)b_18(D) + 128B] ])))
(clobber (reg:CC 17 flags))
]) "t.c":3:8 612 {*iordi_1} 
 (expr_list:REG_DEAD (reg:DI 144)
(expr_list:REG_DEAD (reg:DI 143 [ MEM[(char *)b_18(D) + 128B] ])
(expr_list:REG_UNUSED (reg:CC 17 flags)
(nil) 

which makes LRA spill.  Doing like n_words == 4 and dispatching to
ix86_expand_vector_init_general avoids this and code generates

movq%rax, %xmm0
...
   

[Bug target/109944] vector CTOR with byte elements and SSE2 has STLF fail

2023-05-24 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109944

--- Comment #2 from Hongtao.liu  ---
> I think we can go and for a generic V16QImode CTOR and SSE2 create two
> V8HImode vectors using pinsrw, for the first from zero-extended QImode
> values of the even elements and for the second from zero-extended and
> left-shifted values of the odd elements and then IOR the two vectors.

Or the backend can recognize as as a HImode(b,c) broadcast + HImode(d,e)
vec_set(the middle end can recognize it as VEC_DUPLICATE_EXPR +
.VEC_SET/BIT_INSERT_EXPR when available?)


void foo1(short* a, char *m)
{
  char d = *m;
  char e = m[2];
  short b = d | e << 8;
  a[0] = b;
  a[1] = b;
  a[2] = b;
  a[3] = b;
  a[4] = b;
  a[5] = b;
  a[6] = b;
  a[7] = b;
}

foo1:
movsbw  2(%rsi), %ax
movsbw  (%rsi), %dx
sall$8, %eax
orl %edx, %eax
movd%eax, %xmm0
punpcklwd   %xmm0, %xmm0
pshufd  $0, %xmm0, %xmm0
movups  %xmm0, (%rdi)
ret

[Bug target/109944] vector CTOR with byte elements and SSE2 has STLF fail

2023-05-23 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109944

--- Comment #1 from CVS Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:f504b70eb0fc1339322960041a85606df4547897

commit r14-1137-gf504b70eb0fc1339322960041a85606df4547897
Author: Richard Biener 
Date:   Tue May 23 15:12:33 2023 +0200

Account for vector splat GPR->XMM move cost

The following also accounts for a GPR->XMM move cost for splat
operations and properly guards eliding the cost when moving from
memory only for SSE4.1 or HImode or larger operands.  This
doesn't fix the PR fully yet.

PR target/109944
* config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
For vector construction or splats apply GPR->XMM move
costing.  QImode memory can be handled directly only
with SSE4.1 pinsrb.