[Bug target/94908] Failure to optimally optimize certain shuffle patterns
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94908 --- Comment #12 from Uroš Bizjak --- Implemented also for x86.
[Bug target/94908] Failure to optimally optimize certain shuffle patterns
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94908 --- Comment #11 from CVS Commits --- The master branch has been updated by Uros Bizjak : https://gcc.gnu.org/g:95b99e47f4f2df2d0c5680f45e3ec0a3170218ad commit r14-47-g95b99e47f4f2df2d0c5680f45e3ec0a3170218ad Author: Uros Bizjak Date: Tue Apr 18 17:50:37 2023 +0200 i386: Improve permutations with INSERTPS instruction [PR94908] INSERTPS can select any element from src and insert into any place of the dest. For SSE4.1 targets, compiler can generate e.g. insertps $64, %xmm0, %xmm1 to insert element 1 from %xmm1 to element 0 of %xmm0. gcc/ChangeLog: PR target/94908 * config/i386/i386-builtin.def (__builtin_ia32_insertps128): Use CODE_FOR_sse4_1_insertps_v4sf. * config/i386/i386-expand.cc (expand_vec_perm_insertps): New. (expand_vec_perm_1): Call expand_vec_per_insertps. * config/i386/i386.md ("unspec"): Declare UNSPEC_INSERTPS here. * config/i386/mmx.md (mmxscalarmode): New mode attribute. (@sse4_1_insertps_): New insn pattern. * config/i386/sse.md (@sse4_1_insertps_): Macroize insn pattern from sse4_1_insertps using VI4F_128 mode iterator. gcc/testsuite/ChangeLog: PR target/94908 * gcc.target/i386/pr94908.c: New test. * gcc.target/i386/sse4_1-insertps-5.c: New test. * gcc.target/i386/vperm-v4sf-2-sse4.c: New test.
[Bug target/94908] Failure to optimally optimize certain shuffle patterns
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94908 Uroš Bizjak changed: What|Removed |Added Attachment #54607|0 |1 is obsolete|| --- Comment #10 from Uroš Bizjak --- Created attachment 54624 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54624&action=edit Proposed patch v2 New version with some code shamelessly stolen from aarch64.
[Bug target/94908] Failure to optimally optimize certain shuffle patterns
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94908 --- Comment #9 from Uroš Bizjak --- (In reply to Hongtao.liu from comment #8) > I'm thinking of something like below so it can be matched both by > expand_vselect_vconcat in ix86_expand_vec_perm_const_1 and patterns created > by pass_combine(theoretically). > > +(define_insn_and_split "*sse4_1_insertps_1" > + [(set (match_operand:VI4F_128 0 "register_operand") > + (vec_select:VI4F_128 > + (vec_concat: > + (match_operand:VI4F_128 1 "register_operand") > + (match_operand:VI4F_128 2 "register_operand")) > + (match_parallel 3 "insertps_parallel" > + [(match_operand 4 "const_int_operand")])))] > + "TARGET_SSE4_1 && ix86_pre_reload_split ()" > + "#" > + "&& 1" If you want to go that way, then the resulting pattern should look like combination of: (define_insn "*vec_setv4sf_sse4_1" [(set (match_operand:V4SF 0 "register_operand" "=Yr,*x,v") (vec_merge:V4SF (vec_duplicate:V4SF (match_operand:SF 2 "nonimmediate_operand" "Yrm,*xm,vm")) (match_operand:V4SF 1 "register_operand" "0,0,v") (match_operand:SI 3 "const_0_to_3_operand")))] "TARGET_SSE4_1 && ((unsigned) exact_log2 (INTVAL (operands[3])) < GET_MODE_NUNITS (V4SFmode))" (define_insn_and_split "*sse4_1_extractps" [(set (match_operand:SF 0 "nonimmediate_operand" "=rm,rm,rm,Yv,Yv") (vec_select:SF (match_operand:V4SF 1 "register_operand" "Yr,*x,v,0,v") (parallel [(match_operand:SI 2 "const_0_to_3_operand")])))] "TARGET_SSE4_1" where the later pattern propagates into the former in place of operand 2. This combination is created only for scalar insert of an extracted value, so I doubt it is ever created...
[Bug target/94908] Failure to optimally optimize certain shuffle patterns
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94908 --- Comment #8 from Hongtao.liu --- (In reply to Uroš Bizjak from comment #7) > Created attachment 54607 [details] > Proposed patch > > Patch in testing. > > Attached patch produces (-O2 -msse4.1): > > f: > subq$24, %rsp > xorl%eax, %eax > vmovaps %xmm0, (%rsp) > callg > vmovaps (%rsp), %xmm1 > addq$24, %rsp > vinsertps $64, %xmm0, %xmm1, %xmm0 > ret I'm thinking of something like below so it can be matched both by expand_vselect_vconcat in ix86_expand_vec_perm_const_1 and patterns created by pass_combine(theoretically). +(define_insn_and_split "*sse4_1_insertps_1" + [(set (match_operand:VI4F_128 0 "register_operand") + (vec_select:VI4F_128 + (vec_concat: + (match_operand:VI4F_128 1 "register_operand") + (match_operand:VI4F_128 2 "register_operand")) + (match_parallel 3 "insertps_parallel" + [(match_operand 4 "const_int_operand")])))] + "TARGET_SSE4_1 && ix86_pre_reload_split ()" + "#" + "&& 1"
[Bug target/94908] Failure to optimally optimize certain shuffle patterns
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94908 --- Comment #7 from Uroš Bizjak --- Created attachment 54607 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54607&action=edit Proposed patch Patch in testing. Attached patch produces (-O2 -msse4.1): f: subq$24, %rsp xorl%eax, %eax vmovaps %xmm0, (%rsp) callg vmovaps (%rsp), %xmm1 addq$24, %rsp vinsertps $64, %xmm0, %xmm1, %xmm0 ret
[Bug target/94908] Failure to optimally optimize certain shuffle patterns
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94908 --- Comment #6 from Hongtao.liu --- Yes, insertps can select any element from src and insert into any place of the dest. under sse4.1, x86 can generate vinsertps xmm0, xmm1, xmm0, 64 # xmm0 = xmm0[1],xmm1[1,2,3]
[Bug target/94908] Failure to optimally optimize certain shuffle patterns
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94908 Uroš Bizjak changed: What|Removed |Added CC||crazylht at gmail dot com --- Comment #5 from Uroš Bizjak --- (In reply to Andrew Pinski from comment #4) > Both the x86_64 and the PowerPC PERM implementation could be improved to > support the inseration like the aarch64 backend does too. Cc Hongtao for x86 part.
[Bug target/94908] Failure to optimally optimize certain shuffle patterns
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94908 Andrew Pinski changed: What|Removed |Added Severity|normal |enhancement See Also||https://gcc.gnu.org/bugzill ||a/show_bug.cgi?id=53346, ||https://gcc.gnu.org/bugzill ||a/show_bug.cgi?id=93720 Component|tree-optimization |target --- Comment #4 from Andrew Pinski --- I think this was a target issue and maybe should be split into a couple different bugs. For GCC 8, aarch64 produces: dup v0.4s, v0.s[1] ldr q1, [sp, 16] ldp x29, x30, [sp], 32 ins v0.s[1], v1.s[1] ins v0.s[2], v1.s[2] ins v0.s[3], v1.s[3] For GCC 9/10 did (which is ok, though could be improved which it did in GCC 11): adrpx0, .LC0 ldr q1, [sp, 16] ldr q2, [x0, #:lo12:.LC0] ldp x29, x30, [sp], 32 tbl v0.16b, {v0.16b - v1.16b}, v2.16b For GCC 11+, aarch64 produces: ldr q1, [sp, 16] ins v1.s[0], v0.s[1] mov v0.16b, v1.16b Which means for aarch64, this was changed in GCC 10 and fixed fully for GCC 11 (by r11-2192-gc9c87e6f9c795b aka PR 93720 which was my patch in fact). For x86_64, the trunk produces: movaps (%rsp), %xmm1 addq$24, %rsp shufps $85, %xmm1, %xmm0 shufps $232, %xmm1, %xmm0 While for GCC 12 produces: movaps (%rsp), %xmm1 addq$24, %rsp shufps $85, %xmm0, %xmm0 movaps %xmm1, %xmm2 shufps $85, %xmm1, %xmm2 movaps %xmm2, %xmm3 movaps %xmm1, %xmm2 unpckhps%xmm1, %xmm2 unpcklps%xmm3, %xmm0 shufps $255, %xmm1, %xmm1 unpcklps%xmm1, %xmm2 movlhps %xmm2, %xmm0 This was changed with r13-2843-g3db8e9c2422d92 (aka PR 53346). For powerpc64le, it looks ok for GCC 11: addis 9,2,.LC0@toc@ha addi 1,1,48 addi 9,9,.LC0@toc@l li 0,-16 lvx 0,0,9 vperm 2,31,2,0 Both the x86_64 and the PowerPC PERM implementation could be improved to support the inseration like the aarch64 backend does too.