[Bug tree-optimization/65492] Bad optimization in -O3 due to if-conversion and/or unrolling
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492 Andrew Pinski changed: What|Removed |Added Target Milestone|--- |6.0 Resolution|--- |FIXED Status|NEW |RESOLVED Known to work||6.1.0 See Also||https://gcc.gnu.org/bugzill ||a/show_bug.cgi?id=66002 --- Comment #13 from Andrew Pinski --- This is fixed in GCC 6, we produce: _76 = MIN_EXPR ; offset_75 = MAX_EXPR <_76, 1>; ... _104 = offset_75 + -1; Rather than: offset_30 = _52 <= 254 ? offset_67 : 1; prephitmp_119 = _52 <= 254 ? pretmp_118 : 0; _17 = offset_67 <= 255; offset_69 = _17 ? offset_30 : 255; prephitmp_109 = _17 ? prephitmp_119 : 254; This was fixed by r6-528. PR 66002 is describing almost the same issue even.
[Bug tree-optimization/65492] Bad optimization in -O3 due to if-conversion and/or unrolling
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492 --- Comment #12 from Allan Jensen linux at carewolf dot com --- I have a very crude fix for this. First though, according to comments in tree-if-conv.c and earlier bugs on the issues. If-conversion is suppposed to be conditional. It performed in a piece of conditionally code only to be used if vectorized. For some reason this version appears to be used. But secondly. If conditional move instructions are generally slower than branches, shouldn't they be avoided during instruction selections? The crude fix is simply placing a 'return false;' in the top of ix86_expand_int_movcc in i386.c. So this case somehow triggers a case where the if-conversion that is supposed to only be used by vectorization gets used anyway, but more generally, i386 shouldn't be generating cmov instructions for conditional moves in the first place for modern architectures (anything newer than core2 and bulldozer). At least not without input from a profile run.
[Bug tree-optimization/65492] Bad optimization in -O3 due to if-conversion and/or unrolling
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492 --- Comment #11 from Allan Jensen linux at carewolf dot com --- Issues with slow cmov has been seen in several bug reports: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53346 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54073 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309
[Bug tree-optimization/65492] Bad optimization in -O3 due to if-conversion and/or unrolling
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492 --- Comment #10 from Allan Jensen linux at carewolf dot com --- Just make things more complicated, I just tried the test on a Haswell, and surprisingly disabling if-convert or tree-vectorize on -O3 has no effect on performance, but activating tree-vectorize on -O2 does. In conclusion. This test is slower in -O3 than -O2 on all tested CPUs Phenom, SandyBridge and Haswell, but for different reasons. On Phenom, it is slower due to if-convert, but not unroll (unrolled might even be slightly faster, but only by a small amount). On SandyBridge, it slower due to both if-convert and unroll, and even slower when both are active. On Haswell, it is slower due to both if-convert and unroll, but if-convert on top of unroll is no slower than unroll on its own. In general it is probably safe to try to avoid or undo the if-convert. There appears to be special if-conversions only performed when vectorization is active. Presumably they are only used in that case because they are known to likely be slower when the loop is not vectorized. In this case the if-conversion is done, but the loop not vectorized in the end, just slowing it down (on non Haswell). The unroll issue could perhaps be handled by controlling some optimization params with tuning profiles. Where is trivial unrolling like this even performed?
[Bug tree-optimization/65492] Bad optimization in -O3 due to if-conversion and/or unrolling
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492 --- Comment #8 from Allan Jensen linux at carewolf dot com --- You can remove the branches in the inner loop and still reproduce the issue. There were no branches in the original code, I only added them to the reduced case because I was using a smaller lookup table. I appears after removing the branches, the execution time with and without -fno-tree-vectorize on -O3 is the same. So they also cause some issue, but is the main one.
[Bug tree-optimization/65492] Bad optimization in -O3 due to if-conversion and/or unrolling
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492 --- Comment #7 from Jan Hubicka hubicka at ucw dot cz --- https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492 Richard Biener rguenth at gcc dot gnu.org changed: What|Removed |Added CC||hubicka at gcc dot gnu.org --- Comment #6 from Richard Biener rguenth at gcc dot gnu.org --- --param max-peel-branches default of 32 seems to be quite high. For this loop we have two branches on the hot path and 4 times unrolling. Honza - how did you arrive at the default of 32? Shouldn't that depend on the number of other stmts thus rather look at branch density? In https://gcc.gnu.org/ml/gcc-patches/2012-10/msg02716.html I claim value around 32 is needed for apply. (not that I would recall that) I do not have really strong opinion concerning the branch density. Honza
[Bug tree-optimization/65492] Bad optimization in -O3 due to if-conversion and/or unrolling
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492 --- Comment #9 from Allan Jensen linux at carewolf dot com --- Looking at the assembler, it does indeed appear that the only difference just loop unrolling and if conversion. After testing on another machine (and old PhenomII as opposed to the Sandybridge), and report that disabling tree-loop-if-convert directly or indirectly via tree-loop-vectorize -O3 regains all of the speed difference to -O2 on PhenomII. My guess is that the small loop-unrolling is conflicting with op-cache Intel introduced in the SandyBridge and newer architectures which speeds up small tight loops. On architectures without op-cache the loop-unrolling is probably still slightly faster. Unfortunately, using -mtune=sandybridge does not improve the situation, so maybe there should be some architecture tuning on even trivial loop unrolling, and possibly discussion on making it part of generic-x64 tuning.