[Bug tree-optimization/71414] 2x slower than clang summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions

2023-06-07 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414

--- Comment #15 from rguenther at suse dot de  ---
On Wed, 7 Jun 2023, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414
> 
> --- Comment #14 from Hongtao.liu  ---
> (In reply to Richard Biener from comment #13)
> > The target now has the ability to tell the vectorizer to choose a larger VF
> > based on the cost info it got for the default VF, so the x86 backend could
> > make use of that.  For example with the following patch we'll unroll the
> > vectorized loops 4 times (of course the actual check for small reduction
> > loops and a register pressure estimate is missing).  That generates
> > 
> > .L4:
> > vaddps  (%rax), %zmm1, %zmm1
> > vaddps  64(%rax), %zmm2, %zmm2
> > addq$256, %rax
> > vaddps  -128(%rax), %zmm0, %zmm0
> > vaddps  -64(%rax), %zmm3, %zmm3
> > cmpq%rcx, %rax
> > jne .L4
> > movq%rdx, %rax
> > andq$-64, %rax
> > vaddps  %zmm3, %zmm0, %zmm0
> > vaddps  %zmm2, %zmm1, %zmm1
> > vaddps  %zmm1, %zmm0, %zmm1
> > ... more epilog ...
> > 
> > with -march=znver4 on current trunk.
> > 
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index d4ff56ee8dd..53c09bb9d9c 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -23615,8 +23615,18 @@ class ix86_vector_costs : public vector_costs
> >   stmt_vec_info stmt_info, slp_tree node,
> >   tree vectype, int misalign,
> >   vect_cost_model_location where) override;
> > +  void finish_cost (const vector_costs *uncast_scalar_costs);
> >  };
> >  
> > +void
> > +ix86_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs)
> > +{
> > +  auto *scalar_costs
> > += static_cast (uncast_scalar_costs);
> > +  m_suggested_unroll_factor = 4;
> > +  vector_costs::finish_cost (scalar_costs);
> 
> I remember we have posted an patch for that
> https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604186.html
> 
> One regression observed is the VF of epilog loop will increase(from xmm to 
> ymm)
> after unroll the vectorized loops, and it regressed performance for
> lower-tripcount loop(similar as -mprefer-vector-width=512).

Ah, yeah.  We could resort to check estimated_number_of_iterations
to guide us with profile feedback.  I'm also (again) working on
fully masked epilogues which should reduce the impact on low-trip
count loops.

> Also for the case in the PR, I'm trying to enable
> -fvariable-expansion-in-unroller when -funroll-loops, and the partial sum will
> break reduction chain.

Probably also a good idea - maybe -fvariable-expansion-in-unroller can
be made smarter and guided by register pressure?

[Bug tree-optimization/71414] 2x slower than clang summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions

2023-06-07 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414

--- Comment #14 from Hongtao.liu  ---
(In reply to Richard Biener from comment #13)
> The target now has the ability to tell the vectorizer to choose a larger VF
> based on the cost info it got for the default VF, so the x86 backend could
> make use of that.  For example with the following patch we'll unroll the
> vectorized loops 4 times (of course the actual check for small reduction
> loops and a register pressure estimate is missing).  That generates
> 
> .L4:
> vaddps  (%rax), %zmm1, %zmm1
> vaddps  64(%rax), %zmm2, %zmm2
> addq$256, %rax
> vaddps  -128(%rax), %zmm0, %zmm0
> vaddps  -64(%rax), %zmm3, %zmm3
> cmpq%rcx, %rax
> jne .L4
> movq%rdx, %rax
> andq$-64, %rax
> vaddps  %zmm3, %zmm0, %zmm0
> vaddps  %zmm2, %zmm1, %zmm1
> vaddps  %zmm1, %zmm0, %zmm1
> ... more epilog ...
> 
> with -march=znver4 on current trunk.
> 
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index d4ff56ee8dd..53c09bb9d9c 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -23615,8 +23615,18 @@ class ix86_vector_costs : public vector_costs
>   stmt_vec_info stmt_info, slp_tree node,
>   tree vectype, int misalign,
>   vect_cost_model_location where) override;
> +  void finish_cost (const vector_costs *uncast_scalar_costs);
>  };
>  
> +void
> +ix86_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs)
> +{
> +  auto *scalar_costs
> += static_cast (uncast_scalar_costs);
> +  m_suggested_unroll_factor = 4;
> +  vector_costs::finish_cost (scalar_costs);

I remember we have posted an patch for that
https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604186.html

One regression observed is the VF of epilog loop will increase(from xmm to ymm)
after unroll the vectorized loops, and it regressed performance for
lower-tripcount loop(similar as -mprefer-vector-width=512).

Also for the case in the PR, I'm trying to enable
-fvariable-expansion-in-unroller when -funroll-loops, and the partial sum will
break reduction chain.

[Bug tree-optimization/71414] 2x slower than clang summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions

2023-06-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414

Richard Biener  changed:

   What|Removed |Added

 CC||crazylht at gmail dot com

--- Comment #13 from Richard Biener  ---
The target now has the ability to tell the vectorizer to choose a larger VF
based on the cost info it got for the default VF, so the x86 backend could
make use of that.  For example with the following patch we'll unroll the
vectorized loops 4 times (of course the actual check for small reduction
loops and a register pressure estimate is missing).  That generates

.L4:
vaddps  (%rax), %zmm1, %zmm1
vaddps  64(%rax), %zmm2, %zmm2
addq$256, %rax
vaddps  -128(%rax), %zmm0, %zmm0
vaddps  -64(%rax), %zmm3, %zmm3
cmpq%rcx, %rax
jne .L4
movq%rdx, %rax
andq$-64, %rax
vaddps  %zmm3, %zmm0, %zmm0
vaddps  %zmm2, %zmm1, %zmm1
vaddps  %zmm1, %zmm0, %zmm1
... more epilog ...

with -march=znver4 on current trunk.

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index d4ff56ee8dd..53c09bb9d9c 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -23615,8 +23615,18 @@ class ix86_vector_costs : public vector_costs
  stmt_vec_info stmt_info, slp_tree node,
  tree vectype, int misalign,
  vect_cost_model_location where) override;
+  void finish_cost (const vector_costs *uncast_scalar_costs);
 };

+void
+ix86_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs)
+{
+  auto *scalar_costs
+= static_cast (uncast_scalar_costs);
+  m_suggested_unroll_factor = 4;
+  vector_costs::finish_cost (scalar_costs);
+}
+
 /* Implement targetm.vectorize.create_costs.  */

 static vector_costs *

[Bug tree-optimization/71414] 2x slower than clang summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions

2023-06-06 Thread drraph at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414

Raphael C  changed:

   What|Removed |Added

 CC||drraph at gmail dot com

--- Comment #12 from Raphael C  ---
This problem has been recently discussed at:

https://stackoverflow.com/questions/76407241/why-is-cython-so-much-slower-than-numba-for-this-simple-loop

[Bug tree-optimization/71414] 2x slower than clang summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions

2020-10-11 Thread freddie at witherden dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414

Freddie Witherden  changed:

   What|Removed |Added

 CC||freddie at witherden dot org

--- Comment #11 from Freddie Witherden  ---
I've been looking into this and the big difference appears to be that when
Clang unrolls the loop it does so using multiple accumulators (and indeed does
this without need to be told to unroll.  Given:

double acc(double *x, int n)
{
double a = 0;
#pragma omp simd
for (int i = 0; i < n; i++)
a += x[i];
return a;
}


and compiling with clang -march=native -Ofast -fopenmp -S the core loop reads
as:

vaddpd  (%rdi,%rsi,8), %ymm0, %ymm0
vaddpd  32(%rdi,%rsi,8), %ymm1, %ymm1
vaddpd  64(%rdi,%rsi,8), %ymm2, %ymm2
vaddpd  96(%rdi,%rsi,8), %ymm3, %ymm3
vaddpd  128(%rdi,%rsi,8), %ymm0, %ymm0
vaddpd  160(%rdi,%rsi,8), %ymm1, %ymm1
vaddpd  192(%rdi,%rsi,8), %ymm2, %ymm2
vaddpd  224(%rdi,%rsi,8), %ymm3, %ymm3
vaddpd  256(%rdi,%rsi,8), %ymm0, %ymm0
vaddpd  288(%rdi,%rsi,8), %ymm1, %ymm1
vaddpd  320(%rdi,%rsi,8), %ymm2, %ymm2
vaddpd  352(%rdi,%rsi,8), %ymm3, %ymm3
vaddpd  384(%rdi,%rsi,8), %ymm0, %ymm0
vaddpd  416(%rdi,%rsi,8), %ymm1, %ymm1
vaddpd  448(%rdi,%rsi,8), %ymm2, %ymm2
vaddpd  480(%rdi,%rsi,8), %ymm3, %ymm3

which is heavily unrolled and uses four separate accumulators to hide the
latency of the vector adds.  Interestingly, one could argue that Clang is not
using enough registers given that Skylake can dual-issue adds and they have a
latency of 4 cycles (implying you want 8 separate accumulators).

GCC 10 with gcc -march=skylake -Ofast -fopenmp -S test.c -funroll-loops

vaddpd  -224(%r8), %ymm1, %ymm2
vaddpd  -192(%r8), %ymm2, %ymm3
vaddpd  -160(%r8), %ymm3, %ymm4
vaddpd  -128(%r8), %ymm4, %ymm5
vaddpd  -96(%r8), %ymm5, %ymm6
vaddpd  -64(%r8), %ymm6, %ymm7
vaddpd  -32(%r8), %ymm7, %ymm0

which although it is unrolled, is not a useful unrolling due to the dependency
chain.  Indeed, I would not be surprised if the performance is similar to the
unrolled code as the loop related cruft can be hidden.

[Bug tree-optimization/71414] 2x slower than clang summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions

2016-06-10 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414

--- Comment #10 from rguenther at suse dot de  ---
On June 10, 2016 5:22:39 PM GMT+02:00, "jamborm at gcc dot gnu.org"
 wrote:
>https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414
>
>--- Comment #9 from Martin Jambor  ---
>(In reply to Richard Biener from comment #5)
>> An interesting observation is that we clone sum32 for IPA-CP of n ==
>1024 but
>> for some unknown reason figure Alignment of 'a' as unusable:
>
>The function sum32 is not static, so it can be called from outside of
>the current compilation unit with any alignment.  IPA-CP does not
>clone for alignment, which was my deliberate decision because I found
>it very difficult to reason about its profitability (I am opened to
>suggestions in this area, of course).

It ends up cloning for the constant second arg though so I'd have expected the
alignment to be set on the first...

>Make it static or compile with -flto and you will get:
>
>IPA lattices after all propagation:
>
>Lattices:
>  Node: main/35:
>  Node: sum32/34:
>param [0]: VARIABLE
> ctxs: VARIABLE
> Alignment 64, misalignment 0
>AGGS VARIABLE
>param [1]: 1024 [from: 35(99000)] [loc_time: 0, loc_size: 0, prop_time:
>0,
>prop_size: 
>0]
> ctxs: VARIABLE
> Alignment unusable (BOTTOM)
>AGGS VARIABLE

[Bug tree-optimization/71414] 2x slower than clang summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions

2016-06-10 Thread jamborm at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414

--- Comment #9 from Martin Jambor  ---
(In reply to Richard Biener from comment #5)
> An interesting observation is that we clone sum32 for IPA-CP of n == 1024 but
> for some unknown reason figure Alignment of 'a' as unusable:

The function sum32 is not static, so it can be called from outside of
the current compilation unit with any alignment.  IPA-CP does not
clone for alignment, which was my deliberate decision because I found
it very difficult to reason about its profitability (I am opened to
suggestions in this area, of course).

Make it static or compile with -flto and you will get:

IPA lattices after all propagation:

Lattices:
  Node: main/35:
  Node: sum32/34:
param [0]: VARIABLE
 ctxs: VARIABLE
 Alignment 64, misalignment 0
AGGS VARIABLE
param [1]: 1024 [from: 35(99000)] [loc_time: 0, loc_size: 0, prop_time: 0,
prop_size: 
0]
 ctxs: VARIABLE
 Alignment unusable (BOTTOM)
AGGS VARIABLE

[Bug tree-optimization/71414] 2x slower than clang summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions

2016-06-08 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414

--- Comment #8 from rguenther at suse dot de  ---
On Tue, 7 Jun 2016, yyc1992 at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414
> 
> --- Comment #7 from Yichao Yu  ---
> If I add `-fvariable-expansion-in-unroller` (omg this options is like half the
> command line ;-p ...), the performance matches the clang one after the clang
> 3.8 regression.
> 
> ```
> % gcc -funroll-loops -fvariable-expansion-in-unroller -Ofast -march=core-avx2
> benchmark.c -o benchmark2 
> % ./benchmark2 
> 45.588861
> % ./benchmark-gcc
> 80.518152
> % ./benchmark-clang38
> 41.920054
> % ./benchmark-clang37
> 25.093145
> ```

Yeah, but -fvariable-expansion-in-unroller is quite late.

[Bug tree-optimization/71414] 2x slower than clang summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions

2016-06-07 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414

--- Comment #7 from Yichao Yu  ---
If I add `-fvariable-expansion-in-unroller` (omg this options is like half the
command line ;-p ...), the performance matches the clang one after the clang
3.8 regression.

```
% gcc -funroll-loops -fvariable-expansion-in-unroller -Ofast -march=core-avx2
benchmark.c -o benchmark2 
% ./benchmark2 
45.588861
% ./benchmark-gcc
80.518152
% ./benchmark-clang38
41.920054
% ./benchmark-clang37
25.093145
```

[Bug tree-optimization/71414] 2x slower than clang summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions

2016-06-07 Thread Joost.VandeVondele at mat dot ethz.ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414

Joost VandeVondele  changed:

   What|Removed |Added

 CC||Joost.VandeVondele at mat dot 
ethz
   ||.ch

--- Comment #6 from Joost VandeVondele  
---
Isn't this a case where -fvariable-expansion-in-unroller is helpful ?

> gcc -Ofast t.c -lrt ; ./a.out
285.670206

> gcc -Ofast -funroll-loops -fvariable-expansion-in-unroller  t.c -lrt ; ./a.out
151.246083

> gcc -Ofast -funroll-loops  t.c -lrt ; ./a.out
277.047507

There is some relation with PR25621 I think.

[Bug tree-optimization/71414] 2x slower than clang summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions

2016-06-07 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414

Richard Biener  changed:

   What|Removed |Added

 Target||x86_64-*-*
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2016-06-07
 CC||jamborm at gcc dot gnu.org,
   ||mliska at suse dot cz
  Component|other   |tree-optimization
 Blocks||53947
Summary|2x slower than clang|2x slower than clang
   |summing small float array   |summing small float array,
   ||GCC should consider larger
   ||vectorization factor for
   ||"unrolling" reductions
 Ever confirmed|0   |1

--- Comment #5 from Richard Biener  ---
An interesting observation is that we clone sum32 for IPA-CP of n == 1024 but
for some unknown reason figure Alignment of 'a' as unusable:

Lattices:
  Node: main/35:
  Node: sum32/34:
param [0]: VARIABLE
 ctxs: VARIABLE
 Alignment unusable (BOTTOM)
AGGS VARIABLE
param [1]: VARIABLE
   1024 [from: 35(99000)] [loc_time: 65, loc_size: 10, prop_time:
0, prop_size: 0]
 ctxs: VARIABLE
 Alignment unusable (BOTTOM)
AGGS VARIABLE

Evaluating opportunities for sum32/34.
 - considering value 1024 for param #1 n (caller_count: 1)
 good_cloning_opportunity_p (time: 65, size: 10, freq_sum: 99000) ->
evaluation: 643500, threshold: 500
  Creating a specialized node of sum32/34.
replacing param #1 n with const 1024
Accounting size:7.00, time:72.78 on predicate:(true)
Accounting size:3.00, time:2.00 on new predicate:(not inlined)
 the new node is sum32.constprop/43.


iff LLVM disables IPA CP cloning with 'noinline' the testcase should add
'noclone' as well to be a fair comparison.

The vectorizer decides to peel the loop for alignment (as usual...) and thus
creates both prologue and epilogue loop.  That shouldn't matter in practice
but it likely obfuscates code enough to make the Job for IVOPTs harder.
If the desire was to have nothing known about alignment and 'n' in sum32
the above cannot be avoided anyway.  We also peel both prologue and epilogue
loop.

clang 3.6 (the one I have locally) doesn't peel for alignment and thus uses
unaligned loads and unrolls the loop by 2 only.  It doesn't do any IPA CP
with -Ofast.

Note that the difference WRT clangs unrolling and GCCs unrolling is that
clang uses two accumulators while GCC just processes multiple loads on
the same accumulator with its unrolling.  Thus clang can exploit parallelism
in the pipeline of the CPU while GCC restricts the CPU due to the dependences.

This means that it is the vectorizer that needs to consider using a larger
vectorization factor rather than post-vectorization unrolling (that's likely
to pay off only for reductions).  I wonder what LLVMs heuristic is here.

The IPA-CP alignment question remains though.  Martins?


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations