[Bug target/94908] Failure to optimally optimize certain shuffle patterns

2023-04-18 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94908

--- Comment #12 from Uroš Bizjak  ---
Implemented also for x86.

[Bug target/94908] Failure to optimally optimize certain shuffle patterns

2023-04-18 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94908

--- Comment #11 from CVS Commits  ---
The master branch has been updated by Uros Bizjak :

https://gcc.gnu.org/g:95b99e47f4f2df2d0c5680f45e3ec0a3170218ad

commit r14-47-g95b99e47f4f2df2d0c5680f45e3ec0a3170218ad
Author: Uros Bizjak 
Date:   Tue Apr 18 17:50:37 2023 +0200

i386: Improve permutations with INSERTPS instruction [PR94908]

INSERTPS can select any element from src and insert into any place
of the dest.  For SSE4.1 targets, compiler can generate e.g.

insertps $64, %xmm0, %xmm1

to insert element 1 from %xmm1 to element 0 of %xmm0.

gcc/ChangeLog:

PR target/94908
* config/i386/i386-builtin.def (__builtin_ia32_insertps128):
Use CODE_FOR_sse4_1_insertps_v4sf.
* config/i386/i386-expand.cc (expand_vec_perm_insertps): New.
(expand_vec_perm_1): Call expand_vec_per_insertps.
* config/i386/i386.md ("unspec"): Declare UNSPEC_INSERTPS here.
* config/i386/mmx.md (mmxscalarmode): New mode attribute.
(@sse4_1_insertps_): New insn pattern.
* config/i386/sse.md (@sse4_1_insertps_): Macroize insn
pattern from sse4_1_insertps using VI4F_128 mode iterator.

gcc/testsuite/ChangeLog:

PR target/94908
* gcc.target/i386/pr94908.c: New test.
* gcc.target/i386/sse4_1-insertps-5.c: New test.
* gcc.target/i386/vperm-v4sf-2-sse4.c: New test.

[Bug target/94908] Failure to optimally optimize certain shuffle patterns

2023-03-09 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94908

Uroš Bizjak  changed:

   What|Removed |Added

  Attachment #54607|0   |1
is obsolete||

--- Comment #10 from Uroš Bizjak  ---
Created attachment 54624
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54624&action=edit
Proposed patch v2

New version with some code shamelessly stolen from aarch64.

[Bug target/94908] Failure to optimally optimize certain shuffle patterns

2023-03-09 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94908

--- Comment #9 from Uroš Bizjak  ---
(In reply to Hongtao.liu from comment #8)

> I'm thinking of something like below so it can be matched both by
> expand_vselect_vconcat in ix86_expand_vec_perm_const_1 and patterns created
> by pass_combine(theoretically).
> 
> +(define_insn_and_split "*sse4_1_insertps_1"
> +  [(set (match_operand:VI4F_128 0 "register_operand")
> +   (vec_select:VI4F_128
> + (vec_concat:
> +   (match_operand:VI4F_128 1 "register_operand")
> +   (match_operand:VI4F_128 2 "register_operand"))
> + (match_parallel 3 "insertps_parallel"
> +   [(match_operand 4 "const_int_operand")])))]
> +  "TARGET_SSE4_1 && ix86_pre_reload_split ()"
> +  "#"
> +  "&& 1"

If you want to go that way, then the resulting pattern should look like
combination of:

(define_insn "*vec_setv4sf_sse4_1"
  [(set (match_operand:V4SF 0 "register_operand" "=Yr,*x,v")
(vec_merge:V4SF
  (vec_duplicate:V4SF
(match_operand:SF 2 "nonimmediate_operand" "Yrm,*xm,vm"))
  (match_operand:V4SF 1 "register_operand" "0,0,v")
  (match_operand:SI 3 "const_0_to_3_operand")))]
  "TARGET_SSE4_1
   && ((unsigned) exact_log2 (INTVAL (operands[3]))
   < GET_MODE_NUNITS (V4SFmode))"

(define_insn_and_split "*sse4_1_extractps"
  [(set (match_operand:SF 0 "nonimmediate_operand" "=rm,rm,rm,Yv,Yv")
(vec_select:SF
  (match_operand:V4SF 1 "register_operand" "Yr,*x,v,0,v")
  (parallel [(match_operand:SI 2 "const_0_to_3_operand")])))]
  "TARGET_SSE4_1"

where the later pattern propagates into the former in place of operand 2. This
combination is created only for scalar insert of an extracted value, so I doubt
it is ever created...

[Bug target/94908] Failure to optimally optimize certain shuffle patterns

2023-03-08 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94908

--- Comment #8 from Hongtao.liu  ---
(In reply to Uroš Bizjak from comment #7)
> Created attachment 54607 [details]
> Proposed patch
> 
> Patch in testing.
> 
> Attached patch produces (-O2 -msse4.1):
> 
> f:
> subq$24, %rsp
> xorl%eax, %eax
> vmovaps %xmm0, (%rsp)
> callg
> vmovaps (%rsp), %xmm1
> addq$24, %rsp
> vinsertps   $64, %xmm0, %xmm1, %xmm0
> ret

I'm thinking of something like below so it can be matched both by
expand_vselect_vconcat in ix86_expand_vec_perm_const_1 and patterns created by
pass_combine(theoretically).

+(define_insn_and_split "*sse4_1_insertps_1"
+  [(set (match_operand:VI4F_128 0 "register_operand")
+   (vec_select:VI4F_128
+ (vec_concat:
+   (match_operand:VI4F_128 1 "register_operand")
+   (match_operand:VI4F_128 2 "register_operand"))
+ (match_parallel 3 "insertps_parallel"
+   [(match_operand 4 "const_int_operand")])))]
+  "TARGET_SSE4_1 && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"

[Bug target/94908] Failure to optimally optimize certain shuffle patterns

2023-03-08 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94908

--- Comment #7 from Uroš Bizjak  ---
Created attachment 54607
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54607&action=edit
Proposed patch

Patch in testing.

Attached patch produces (-O2 -msse4.1):

f:
subq$24, %rsp
xorl%eax, %eax
vmovaps %xmm0, (%rsp)
callg
vmovaps (%rsp), %xmm1
addq$24, %rsp
vinsertps   $64, %xmm0, %xmm1, %xmm0
ret

[Bug target/94908] Failure to optimally optimize certain shuffle patterns

2023-02-20 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94908

--- Comment #6 from Hongtao.liu  ---
Yes, insertps can select any element from src and insert into any place of the
dest. under sse4.1, x86 can generate 
  vinsertps   xmm0, xmm1, xmm0, 64  # xmm0 = xmm0[1],xmm1[1,2,3]

[Bug target/94908] Failure to optimally optimize certain shuffle patterns

2023-02-18 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94908

Uroš Bizjak  changed:

   What|Removed |Added

 CC||crazylht at gmail dot com

--- Comment #5 from Uroš Bizjak  ---
(In reply to Andrew Pinski from comment #4)
> Both the x86_64 and the PowerPC PERM implementation could be improved to
> support the inseration like the aarch64 backend does too.

Cc Hongtao for x86 part.

[Bug target/94908] Failure to optimally optimize certain shuffle patterns

2023-02-17 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94908

Andrew Pinski  changed:

   What|Removed |Added

   Severity|normal  |enhancement
   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=53346,
   ||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=93720
  Component|tree-optimization   |target

--- Comment #4 from Andrew Pinski  ---
I think this was a target issue and maybe should be split into a couple
different bugs.

For GCC 8, aarch64 produces:
dup v0.4s, v0.s[1]
ldr q1, [sp, 16]
ldp x29, x30, [sp], 32
ins v0.s[1], v1.s[1]
ins v0.s[2], v1.s[2]
ins v0.s[3], v1.s[3]


For GCC 9/10 did (which is ok, though could be improved which it did in GCC
11):
adrpx0, .LC0
ldr q1, [sp, 16]
ldr q2, [x0, #:lo12:.LC0]
ldp x29, x30, [sp], 32
tbl v0.16b, {v0.16b - v1.16b}, v2.16b
For GCC 11+, aarch64 produces:
ldr q1, [sp, 16]
ins v1.s[0], v0.s[1]
mov v0.16b, v1.16b


Which means for aarch64, this was changed in GCC 10 and fixed fully for GCC 11
(by r11-2192-gc9c87e6f9c795b aka PR 93720 which was my patch in fact).

For x86_64, the trunk produces:

movaps  (%rsp), %xmm1
addq$24, %rsp
shufps  $85, %xmm1, %xmm0
shufps  $232, %xmm1, %xmm0

While for GCC 12 produces:

movaps  (%rsp), %xmm1
addq$24, %rsp
shufps  $85, %xmm0, %xmm0
movaps  %xmm1, %xmm2
shufps  $85, %xmm1, %xmm2
movaps  %xmm2, %xmm3
movaps  %xmm1, %xmm2
unpckhps%xmm1, %xmm2
unpcklps%xmm3, %xmm0
shufps  $255, %xmm1, %xmm1
unpcklps%xmm1, %xmm2
movlhps %xmm2, %xmm0

This was changed with r13-2843-g3db8e9c2422d92 (aka PR 53346).

For powerpc64le, it looks ok for GCC 11:
addis 9,2,.LC0@toc@ha
addi 1,1,48
addi 9,9,.LC0@toc@l
li 0,-16
lvx 0,0,9
vperm 2,31,2,0

Both the x86_64 and the PowerPC PERM implementation could be improved to
support the inseration like the aarch64 backend does too.