[Bug target/109944] vector CTOR with byte elements and SSE2 has STLF fail
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109944 Richard Biener changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #7 from Richard Biener --- Summary is fixed now. Any other changes require actual benchmarking I think.
[Bug target/109944] vector CTOR with byte elements and SSE2 has STLF fail
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109944 --- Comment #6 from CVS Commits --- The master branch has been updated by Richard Biener : https://gcc.gnu.org/g:affee7dcfa1ee272d43ac7cb68cf423dbd956fd8 commit r14-1166-gaffee7dcfa1ee272d43ac7cb68cf423dbd956fd8 Author: Richard Biener Date: Wed May 24 10:07:36 2023 +0200 target/109944 - avoid STLF fail for V16QImode CTOR expansion The following dispatches to V2DImode CTOR expansion instead of using sets of (subreg:DI (reg:V16QI 146) [08]) which causes LRA to spill DImode and reload V16QImode. The same applies for V8QImode or V4HImode construction from SImode parts which happens during 32bit libgcc build. PR target/109944 * config/i386/i386-expand.cc (ix86_expand_vector_init_general): Perform final vector composition using ix86_expand_vector_init_general instead of setting the highpart and lowpart which causes spilling. * gcc.target/i386/pr109944-1.c: New testcase. * gcc.target/i386/pr109944-2.c: Likewise.
[Bug target/109944] vector CTOR with byte elements and SSE2 has STLF fail
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109944 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #5 from Alexander Monakov --- (In reply to Richard Biener from comment #3) > so we're building SImode elements in %xmm regs and then > unpack them - that's probably better than a series of > pinsrw due to dependences. For uarchs where grp->xmm > moves are costly it might be better to do > > pxor %xmm0, %xmm0 > pinsrw $0, (%rsi), %xmm0 > pinsrw $1, 32(%rsi), %xmm0 > > though? I'm afraid that is impossible, pinsrw will attempt to load 2 bytes, but only 1 is accessible (if at end of page).
[Bug target/109944] vector CTOR with byte elements and SSE2 has STLF fail
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109944 Richard Biener changed: What|Removed |Added Last reconfirmed||2023-05-24 Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Ever confirmed|0 |1 Status|UNCONFIRMED |ASSIGNED --- Comment #4 from Richard Biener --- I'm testing that.
[Bug target/109944] vector CTOR with byte elements and SSE2 has STLF fail
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109944 --- Comment #3 from rguenther at suse dot de --- On Wed, 24 May 2023, crazylht at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109944 > > --- Comment #2 from Hongtao.liu --- > > I think we can go and for a generic V16QImode CTOR and SSE2 create two > > V8HImode vectors using pinsrw, for the first from zero-extended QImode > > values of the even elements and for the second from zero-extended and > > left-shifted values of the odd elements and then IOR the two vectors. > > Or the backend can recognize as as a HImode(b,c) broadcast + HImode(d,e) > vec_set(the middle end can recognize it as VEC_DUPLICATE_EXPR + > .VEC_SET/BIT_INSERT_EXPR when available?) Yeah. Note we need to handle the general case with 16 distinct elements as well. For simplicity I'm using a memory input below instead of 16 function parameters. void foo (char * __restrict a, char *b) { a[0] = b[0]; a[1] = b[16]; a[2] = b[32]; a[3] = b[48]; a[4] = b[64]; a[5] = b[80]; a[6] = b[96]; a[7] = b[112]; a[8] = b[128]; a[9] = b[144]; a[10] = b[160]; a[11] = b[176]; a[12] = b[192]; a[13] = b[208]; a[14] = b[224]; a[15] = b[240]; } with -O2 generates foo: .LFB0: .cfi_startproc movzbl 112(%rsi), %edx movzbl 96(%rsi), %eax movzbl 224(%rsi), %r8d movzbl (%rsi), %ecx salq$8, %rdx orq %rax, %rdx movzbl 80(%rsi), %eax salq$8, %rdx orq %rax, %rdx movzbl 64(%rsi), %eax ... more of that ... orq %r8, %rax movzbl 144(%rsi), %r8d movzbl 128(%rsi), %esi salq$8, %rax orq %r8, %rax salq$8, %rax orq %rsi, %rax movq%rax, -16(%rsp) movdqa -24(%rsp), %xmm0 movups %xmm0, (%rdi) ret so a way is to form HImode elements in GPRs by shift and or and then build a V8HImode vector from that. Note code generation for a V8HImode CTOR also looks imperfect (change char to short and only do elements 0 to 7 above): foo: .LFB0: .cfi_startproc movzwl (%rsi), %eax movd%eax, %xmm0 movzwl 64(%rsi), %eax pinsrw $1, 32(%rsi), %xmm0 movd%eax, %xmm3 movzwl 128(%rsi), %eax pinsrw $1, 96(%rsi), %xmm3 movd%eax, %xmm1 movzwl 192(%rsi), %eax punpckldq %xmm3, %xmm0 pinsrw $1, 160(%rsi), %xmm1 movd%eax, %xmm2 pinsrw $1, 224(%rsi), %xmm2 punpckldq %xmm2, %xmm1 punpcklqdq %xmm1, %xmm0 movups %xmm0, (%rdi) ret so we're building SImode elements in %xmm regs and then unpack them - that's probably better than a series of pinsrw due to dependences. For uarchs where grp->xmm moves are costly it might be better to do pxor %xmm0, %xmm0 pinsrw $0, (%rsi), %xmm0 pinsrw $1, 32(%rsi), %xmm0 though? pinsr* are not especially fast (2uops on zen, latency 3, throughput 1 - on zen4 it got worse), so maybe forming SImode in GPRs might be better for them (or mixing both to better utilize execution resources). For the 16 or 8 (distinct) element CTORs there's hardly surrounding code we can hope to execute in parallel. I wonder if we can easily determine in the expander whether we deal with elements that are nicely available in GPRs or in XMMs or whether we need to deal with wrong choices later, for example in STV. But of course first we need to avoid spill/reload. That's from ix86_expand_vector_init_general doing else if (n_words == 2) { rtx tmp = gen_reg_rtx (mode); emit_clobber (tmp); emit_move_insn (gen_lowpart (tmp_mode, tmp), words[0]); emit_move_insn (gen_highpart (tmp_mode, tmp), words[1]); emit_move_insn (target, tmp); which generates (subreg:DI (reg:V16QI 146) 0) and (subreg:DI (reg:V16QI 146) 8) and then (insn 52 51 53 2 (parallel [ (set (subreg:DI (reg:V16QI 146) 0) (ior:DI (reg:DI 122) (reg:DI 121 [ *b_18(D) ]))) (clobber (reg:CC 17 flags)) ]) "t.c":3:8 612 {*iordi_1} (expr_list:REG_DEAD (reg:DI 122) (expr_list:REG_DEAD (reg:DI 121 [ *b_18(D) ]) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil) (insn 53 52 55 2 (parallel [ (set (subreg:DI (reg:V16QI 146) 8) (ior:DI (reg:DI 144) (reg:DI 143 [ MEM[(char *)b_18(D) + 128B] ]))) (clobber (reg:CC 17 flags)) ]) "t.c":3:8 612 {*iordi_1} (expr_list:REG_DEAD (reg:DI 144) (expr_list:REG_DEAD (reg:DI 143 [ MEM[(char *)b_18(D) + 128B] ]) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil) which makes LRA spill. Doing like n_words == 4 and dispatching to ix86_expand_vector_init_general avoids this and code generates movq%rax, %xmm0 ...
[Bug target/109944] vector CTOR with byte elements and SSE2 has STLF fail
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109944 --- Comment #2 from Hongtao.liu --- > I think we can go and for a generic V16QImode CTOR and SSE2 create two > V8HImode vectors using pinsrw, for the first from zero-extended QImode > values of the even elements and for the second from zero-extended and > left-shifted values of the odd elements and then IOR the two vectors. Or the backend can recognize as as a HImode(b,c) broadcast + HImode(d,e) vec_set(the middle end can recognize it as VEC_DUPLICATE_EXPR + .VEC_SET/BIT_INSERT_EXPR when available?) void foo1(short* a, char *m) { char d = *m; char e = m[2]; short b = d | e << 8; a[0] = b; a[1] = b; a[2] = b; a[3] = b; a[4] = b; a[5] = b; a[6] = b; a[7] = b; } foo1: movsbw 2(%rsi), %ax movsbw (%rsi), %dx sall$8, %eax orl %edx, %eax movd%eax, %xmm0 punpcklwd %xmm0, %xmm0 pshufd $0, %xmm0, %xmm0 movups %xmm0, (%rdi) ret
[Bug target/109944] vector CTOR with byte elements and SSE2 has STLF fail
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109944 --- Comment #1 from CVS Commits --- The master branch has been updated by Richard Biener : https://gcc.gnu.org/g:f504b70eb0fc1339322960041a85606df4547897 commit r14-1137-gf504b70eb0fc1339322960041a85606df4547897 Author: Richard Biener Date: Tue May 23 15:12:33 2023 +0200 Account for vector splat GPR->XMM move cost The following also accounts for a GPR->XMM move cost for splat operations and properly guards eliding the cost when moving from memory only for SSE4.1 or HImode or larger operands. This doesn't fix the PR fully yet. PR target/109944 * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): For vector construction or splats apply GPR->XMM move costing. QImode memory can be handled directly only with SSE4.1 pinsrb.