[Bug tree-optimization/65492] Bad optimization in -O3 due to if-conversion and/or unrolling

2021-08-14 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492

Andrew Pinski  changed:

   What|Removed |Added

   Target Milestone|--- |6.0
 Resolution|--- |FIXED
 Status|NEW |RESOLVED
  Known to work||6.1.0
   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=66002

--- Comment #13 from Andrew Pinski  ---
This is fixed in GCC 6, we produce:

  _76 = MIN_EXPR ;
  offset_75 = MAX_EXPR <_76, 1>;
...

  _104 = offset_75 + -1;

Rather than:

  offset_30 = _52 <= 254 ? offset_67 : 1;
  prephitmp_119 = _52 <= 254 ? pretmp_118 : 0;
  _17 = offset_67 <= 255;
  offset_69 = _17 ? offset_30 : 255;
  prephitmp_109 = _17 ? prephitmp_119 : 254;

This was fixed by r6-528.  PR 66002 is describing almost the same issue even.

[Bug tree-optimization/65492] Bad optimization in -O3 due to if-conversion and/or unrolling

2015-03-31 Thread linux at carewolf dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492

--- Comment #12 from Allan Jensen linux at carewolf dot com ---
I have a very crude fix for this.

First though, according to comments in tree-if-conv.c and earlier bugs on the
issues. If-conversion is suppposed to be conditional. It performed in a piece
of conditionally code only to be used if vectorized. For some reason this
version appears to be used.

But secondly. If conditional move instructions are generally slower than
branches, shouldn't they be avoided during instruction selections? The crude
fix is simply placing a 'return false;' in the top of ix86_expand_int_movcc in
i386.c.

So this case somehow triggers a case where the if-conversion that is supposed
to only be used by vectorization gets used anyway, but more generally, i386
shouldn't be generating cmov instructions for conditional moves in the first
place for modern architectures (anything newer than core2 and bulldozer). At
least not without input from a profile run.


[Bug tree-optimization/65492] Bad optimization in -O3 due to if-conversion and/or unrolling

2015-03-24 Thread linux at carewolf dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492

--- Comment #11 from Allan Jensen linux at carewolf dot com ---
Issues with slow cmov has been seen in several bug reports:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53346
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54073
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309


[Bug tree-optimization/65492] Bad optimization in -O3 due to if-conversion and/or unrolling

2015-03-21 Thread linux at carewolf dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492

--- Comment #10 from Allan Jensen linux at carewolf dot com ---
Just make things more complicated, I just tried the test on a Haswell, and
surprisingly disabling if-convert or tree-vectorize on -O3 has no effect on
performance, but activating tree-vectorize on -O2 does.

In conclusion. This test is slower in -O3 than -O2 on all tested CPUs Phenom,
SandyBridge and Haswell, but for different reasons.

On Phenom, it is slower due to if-convert, but not unroll (unrolled might even
be slightly faster, but only by a small amount).
On SandyBridge, it slower due to both if-convert and unroll, and even slower
when both are active.
On Haswell, it is slower due to both if-convert and unroll, but if-convert on
top of unroll is no slower than unroll on its own.

In general it is probably safe to try to avoid or undo the if-convert. There
appears to be special if-conversions only performed when vectorization is
active. Presumably they are only used in that case because they are known to
likely be slower when the loop is not vectorized. In this case the
if-conversion is done, but the loop not vectorized in the end, just slowing it
down (on non Haswell).

The unroll issue could perhaps be handled by controlling some optimization
params with tuning profiles. Where is trivial unrolling like this even
performed?


[Bug tree-optimization/65492] Bad optimization in -O3 due to if-conversion and/or unrolling

2015-03-20 Thread linux at carewolf dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492

--- Comment #8 from Allan Jensen linux at carewolf dot com ---
You can remove the branches in the inner loop and still reproduce the issue.
There were no branches in the original code, I only added them to the reduced
case because I was using a smaller lookup table.

I appears after removing the branches, the execution time with and without
-fno-tree-vectorize on -O3 is the same. So they also cause some issue, but is
the main one.


[Bug tree-optimization/65492] Bad optimization in -O3 due to if-conversion and/or unrolling

2015-03-20 Thread hubicka at ucw dot cz
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492

--- Comment #7 from Jan Hubicka hubicka at ucw dot cz ---
 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492
 
 Richard Biener rguenth at gcc dot gnu.org changed:
 
What|Removed |Added
 
  CC||hubicka at gcc dot gnu.org
 
 --- Comment #6 from Richard Biener rguenth at gcc dot gnu.org ---
 --param max-peel-branches default of 32 seems to be quite high.  For this
 loop we have two branches on the hot path and 4 times unrolling.
 
 Honza - how did you arrive at the default of 32?  Shouldn't that depend
 on the number of other stmts thus rather look at branch density?

In https://gcc.gnu.org/ml/gcc-patches/2012-10/msg02716.html I claim value
around 32 is needed for apply. (not that I would recall that)

I do not have really strong opinion concerning the branch density.

Honza


[Bug tree-optimization/65492] Bad optimization in -O3 due to if-conversion and/or unrolling

2015-03-20 Thread linux at carewolf dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492

--- Comment #9 from Allan Jensen linux at carewolf dot com ---
Looking at the assembler, it does indeed appear that the only difference just
loop unrolling and if conversion. 

After testing on another machine (and old PhenomII as opposed to the
Sandybridge), and report that disabling tree-loop-if-convert directly or
indirectly via tree-loop-vectorize -O3 regains all of the speed difference to
-O2 on PhenomII.

My guess is that the small loop-unrolling is conflicting with op-cache Intel
introduced in the SandyBridge and newer architectures which speeds up small
tight loops. On architectures without op-cache the loop-unrolling is probably
still slightly faster.

Unfortunately, using -mtune=sandybridge does not improve the situation, so
maybe there should be some architecture tuning on even trivial loop unrolling,
and possibly discussion on making it part of generic-x64 tuning.