[Bug tree-optimization/82426] Missed tree-slp-vectorization on -O2 and -O3

2021-09-27 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82426

Richard Biener  changed:

   What|Removed |Added

 Resolution|--- |FIXED
   Target Milestone|--- |12.0
 Status|ASSIGNED|RESOLVED

--- Comment #8 from Richard Biener  ---
Fixed for GCC 12.

[Bug tree-optimization/82426] Missed tree-slp-vectorization on -O2 and -O3

2021-09-27 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82426

--- Comment #7 from CVS Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:6390c5047adb75960f86d56582e6322aaa4d9281

commit r12-3893-g6390c5047adb75960f86d56582e6322aaa4d9281
Author: Richard Biener 
Date:   Wed Nov 18 09:36:57 2020 +0100

Allow different vector types for stmt groups

This allows vectorization (in practice non-loop vectorization) to
have a stmt participate in different vector type vectorizations.
It allows us to remove vect_update_shared_vectype and replace it
by pushing/popping STMT_VINFO_VECTYPE from SLP_TREE_VECTYPE around
vect_analyze_stmt and vect_transform_stmt.

For data-ref the situation is a bit more complicated since we
analyze alignment info with a specific vector type in mind which
doesn't play well when that changes.

So the bulk of the change is passing down the actual vector type
used for a vectorized access to the various accessors of alignment
info, first and foremost dr_misalignment but also aligned_access_p,
known_alignment_for_access_p, vect_known_alignment_in_bytes and
vect_supportable_dr_alignment.  I took the liberty to replace
ALL_CAPS macro accessors with the lower-case function invocations.

The actual changes to the behavior are in dr_misalignment which now
is the place factoring in the negative step adjustment as well as
handling alignment queries for a vector type with bigger alignment
requirements than what we can (or have) analyze(d).

vect_slp_analyze_node_alignment makes use of this and upon receiving
a vector type with a bigger alingment desire re-analyzes the DR
with respect to it but keeps an older more precise result if possible.
In this context it might be possible to do the analysis just once
but instead of analyzing with respect to a specific desired alignment
look for the biggest alignment we can compute a not unknown alignment.

The ChangeLog includes the functional changes but not the bulk due
to the alignment accessor API changes - I hope that's something good.

2021-09-17  Richard Biener  

PR tree-optimization/97351
PR tree-optimization/97352
PR tree-optimization/82426
* tree-vectorizer.h (dr_misalignment): Add vector type
argument.
(aligned_access_p): Likewise.
(known_alignment_for_access_p): Likewise.
(vect_supportable_dr_alignment): Likewise.
(vect_known_alignment_in_bytes): Likewise.  Refactor.
(DR_MISALIGNMENT): Remove.
(vect_update_shared_vectype): Likewise.
* tree-vect-data-refs.c (dr_misalignment): Refactor, handle
a vector type with larger alignment requirement and apply
the negative step adjustment here.
(vect_calculate_target_alignment): Remove.
(vect_compute_data_ref_alignment): Get explicit vector type
argument, do not apply a negative step alignment adjustment
here.
(vect_slp_analyze_node_alignment): Re-analyze alignment
when we re-visit the DR with a bigger desired alignment but
keep more precise results from smaller alignments.
* tree-vect-slp.c (vect_update_shared_vectype): Remove.
(vect_slp_analyze_node_operations_1): Do not update the
shared vector type on stmts.
* tree-vect-stmts.c (vect_analyze_stmt): Push/pop the
vector type of an SLP node to the representative stmt-info.
(vect_transform_stmt): Likewise.

* gcc.target/i386/vect-pr82426.c: New testcase.
* gcc.target/i386/vect-pr97352.c: Likewise.

[Bug tree-optimization/82426] Missed tree-slp-vectorization on -O2 and -O3

2021-09-20 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82426

Richard Biener  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #6 from Richard Biener  ---
I have a patch that produces

  vect__1.5_42 = MEM  [(float *)a_34(D)];
  vect__1.7_47 = VEC_PERM_EXPR ;
  vect__2.10_49 = MEM  [(float *)b_35(D)];
  vect__2.12_53 = VEC_PERM_EXPR ;
  vect__3.13_54 = vect__1.7_47 * vect__2.12_53;
  vect__2.30_73 = MEM  [(float *)b_35(D)];
  vect__1.18_61 = VEC_PERM_EXPR ;
  vect__2.23_68 = VEC_PERM_EXPR ;
  vect__6.24_69 = vect__1.18_61 * vect__2.23_68;
  vect__7.25_70 = vect__3.13_54 + vect__6.24_69;
  vect__5.40_85 = MEM  [(float *)b_35(D) + 8B];
  MEM  [(float *)&] = vect__7.25_70;
  vect__21.35_81 = MEM  [(float *)a_34(D) + 16B];
  vect__1.36_82 = VEC_PERM_EXPR ;
  vect__22.37_83 = vect__2.30_73 * vect__1.36_82;
  vect__1.46_94 = VEC_PERM_EXPR ;
  vect__24.47_95 = vect__5.40_85 * vect__1.46_94;
  vect__25.48_96 = vect__22.37_83 + vect__24.47_95;
  vect__26.51_98 = MEM  [(float *)b_35(D) + 16B];
  vect__27.52_100 = vect__25.48_96 + vect__26.51_98;
  MEM  [(float *)& + 16B] = vect__27.52_100;

that means it ends up with some odd vector loads, but with SSE 4.2 it becomes

movups  (%rsi), %xmm5
movups  (%rdx), %xmm1
movq%rdi, %rax
movq(%rdx), %xmm4
movq8(%rdx), %xmm3
movsldup%xmm5, %xmm0
movaps  %xmm1, %xmm2
movlhps %xmm1, %xmm2
shufps  $238, %xmm1, %xmm1
mulps   %xmm0, %xmm2
movshdup%xmm5, %xmm0
mulps   %xmm1, %xmm0
movq16(%rsi), %xmm1
addps   %xmm2, %xmm0
movups  %xmm0, (%rdi)
movsldup%xmm1, %xmm0
movshdup%xmm1, %xmm1
mulps   %xmm4, %xmm0
mulps   %xmm3, %xmm1
addps   %xmm1, %xmm0
movq16(%rdx), %xmm1
addps   %xmm1, %xmm0
movlps  %xmm0, 16(%rdi)

alternatively -mavx can do some of the required perms with the loads and
with -mfma we can use an FMA as well:

vpermilps   $238, (%rdx), %xmm1
vpermilps   $245, (%rsi), %xmm0
movq%rdi, %rax
vpermilps   $160, (%rsi), %xmm3
vpermilps   $68, (%rdx), %xmm4
vmulps  %xmm1, %xmm0, %xmm0
vmovq   (%rdx), %xmm2
vfmadd231ps %xmm4, %xmm3, %xmm0
vmovq   8(%rdx), %xmm3
vmovups %xmm0, (%rdi)
vmovq   16(%rsi), %xmm0
vmovsldup   %xmm0, %xmm1
vmovshdup   %xmm0, %xmm0
vmulps  %xmm3, %xmm0, %xmm0
vfmadd132ps %xmm1, %xmm0, %xmm2
vmovq   16(%rdx), %xmm0
vaddps  %xmm2, %xmm0, %xmm0
vmovlps %xmm0, 16(%rdi)

I'm not sure whether the vmovups + vmovs{l,h}dup are any better than doing
two scalar loads + dups though - it might avoid some STLF conflict with
earlier smaller stores at least.

[Bug tree-optimization/82426] Missed tree-slp-vectorization on -O2 and -O3

2021-08-25 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82426

Richard Biener  changed:

   What|Removed |Added

 Target||x86_64-*-*

--- Comment #5 from Richard Biener  ---
x86 actually does have V2SF, the issue is that there's an opportunity for V4SF
vectorization and one for V2SF arriving at the same load groups and that causes
a conflict (there's other PRs about this general issue), so we kill one part:

t.C:18:12: missed:   desired vector type conflicts with earlier one for _2 =
b_35(D)->m11;
t.C:18:12: note:  removing SLP instance operations starting from: .dx =
_27;

also we have a bunch of live lanes off the remaining vectorized piece which
makes code a bit awkward.

Unfortunately we have no way to force 64bit vectors here (V2SF) to see whether
splitting up the V4SFmode partition would help (I guess it would as can be
seen from using 'double').

[Bug tree-optimization/82426] Missed tree-slp-vectorization on -O2 and -O3

2021-08-24 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82426

Andrew Pinski  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
   Severity|normal  |enhancement
   Last reconfirmed||2021-08-25

--- Comment #4 from Andrew Pinski  ---
Hmm, on aarch64 we do a decent job at vectorizing this (since GCC 11):
ldp d4, d0, [x1]
ldr d7, [x0, 16]
ldp d6, d5, [x0]
fmulv3.2s, v0.2s, v7.s[1]
ldr d1, [x1, 16]
fmulv2.2s, v0.2s, v6.s[1]
fmulv0.2s, v0.2s, v5.s[1]
fmlav3.2s, v4.2s, v7.s[0]
fmlav2.2s, v4.2s, v6.s[0]
fmlav0.2s, v4.2s, v5.s[0]
faddv1.2s, v1.2s, v3.2s
stp d2, d0, [x8]
str d1, [x8, 16]

I suspect this is because V2SF does not exist on x86_64.
Using -Dfloat=double seems to get better for x86_64 (with -mavx2):
vmovupd (%rdx), %ymm0
vpermilpd   $0, (%rsi), %ymm1
movq%rdi, %rax
vmovsd  32(%rsi), %xmm5
vmovsd  40(%rsi), %xmm4
vpermpd $68, %ymm0, %ymm2
vpermpd $238, %ymm0, %ymm3
vmulpd  %ymm2, %ymm1, %ymm2
vpermilpd   $15, (%rsi), %ymm1
vmulpd  %ymm3, %ymm1, %ymm1
vaddpd  %ymm1, %ymm2, %ymm1
vmulsd  %xmm5, %xmm0, %xmm2
vmovupd %ymm1, (%rdi)
vmovapd %xmm0, %xmm1
vextractf128$0x1, %ymm0, %xmm0
vmulsd  %xmm4, %xmm0, %xmm3
vunpckhpd   %xmm1, %xmm1, %xmm1
vunpckhpd   %xmm0, %xmm0, %xmm0
vmulsd  %xmm5, %xmm1, %xmm1
vmulsd  %xmm4, %xmm0, %xmm0
vaddsd  %xmm3, %xmm2, %xmm2
vaddsd  32(%rdx), %xmm2, %xmm2
vaddsd  %xmm0, %xmm1, %xmm1
vaddsd  40(%rdx), %xmm1, %xmm1
vmovsd  %xmm2, 32(%rdi)
vmovsd  %xmm1, 40(%rdi)

[Bug tree-optimization/82426] Missed tree-slp-vectorization on -O2 and -O3

2017-10-04 Thread linux at carewolf dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82426

--- Comment #3 from Allan Jensen  ---
Note it appears the fact it can do it at all in -Os is new in gcc 7

[Bug tree-optimization/82426] Missed tree-slp-vectorization on -O2 and -O3

2017-10-04 Thread linux at carewolf dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82426

--- Comment #2 from Allan Jensen  ---
Created attachment 42301
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42301&action=edit
Assembler output with -Os -ftree-slp-vectorize

[Bug tree-optimization/82426] Missed tree-slp-vectorization on -O2 and -O3

2017-10-04 Thread linux at carewolf dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82426

--- Comment #1 from Allan Jensen  ---
Created attachment 42300
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42300&action=edit
Assembler output with -O3