[Bug tree-optimization/80520] [7/8 Regression] Performance regression from missing if-conversion
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80520 Richard Biener changed: What|Removed |Added Target Milestone|7.3 |7.4 --- Comment #11 from Richard Biener --- GCC 7.3 is being released, adjusting target milestone.
[Bug tree-optimization/80520] [7/8 Regression] Performance regression from missing if-conversion
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80520 --- Comment #10 from Richard Biener --- (In reply to Jeffrey A. Law from comment #9) > So AFAICT there's two issues that need to be addressed. PRE and split-paths. > > First up is PRE. Compile the sample code from c#5/c#6 with -O3 > -fno-split-paths > > > Prior to PRE we have: > > if (_16 != 0) > goto ; [50.00%] > else > goto ; [50.00%] > >[local count: 531502203]: > >[local count: 1063004407]: > # iftmp.0_19 = PHI <2567483615(3), 0(4)> > _17 = _15 ^ iftmp.0_19; > > That's actually reasonably good. While it's not a conditional move in the > gimple. It's in a form will be easy for the RTL optimizers to handle and > generate a suitable cmov if we just left it alone on x86_64. > > > PRE (correctly) identifies that it can reduce the number of expression > evaluations on the path traversing bb3->bb5 by hoisting the XOR with the > non-zero constant into BB4 resulting in: > > if (_16 != 0) > goto ; [50.00%] > else > goto ; [50.00%] > >[local count: 531502203]: > _52 = _15 ^ 2567483615; > >[local count: 1063004407]: > # iftmp.0_19 = PHI <2567483615(4), 0(3)> > # prephitmp_53 = PHI <_52(4), _15(3)> > > That's correct, but far from ideal. > > > So the second issue is split-paths. There's actually two problems to deal > with in split-paths. > > > > > As it stands today this is what we see in split-paths (as a result of the > PRE de-optimization): > > > > [ ... ] > if (_20 != 0) > goto ; [50.00%] > else > goto ; [50.00%] > >[local count: 531502203]: > _18 = _25 ^ 2567483615; > >[local count: 1063004407]: > # prephitmp_49 = PHI <_25(3), _18(4)> > _2 = (void *) ivtmp.8_30; > MEM[base: _2, offset: 0B] = prephitmp_49; > ivtmp.8_29 = ivtmp.8_30 + 8; > if (ivtmp.8_29 != _6) > goto ; [98.99%] > else > goto ; [1.01%] > > split-paths should try not to muck it up further. Note that we can probably > identify this half-diamond pretty easily. bb3 dominates bb4. bb4 has a > single statement that feeds a PHI in bb5. That's a very likely > if-conversion candidate so split-paths ought to leave it alone. > > If we were to fix PRE then split-paths would be presented with something > like this: > > > [ ... ] > if (_47 != 0) > goto ; [50.00%] > else > goto ; [50.00%] > >[local count: 531502203]: > >[local count: 1063004407]: > # iftmp.0_48 = PHI <2567483615(3), 0(4)> > _49 = _18 ^ iftmp.0_48; > > ISTM that when either of the blocks in question (bb3 bb4) has *no* > statements, with a single pred that is the other block then split-blocks > definitely should leave it alone as well. > > So, to summarize. > > 1. PRE mucks things up a bit. > 2. split-paths makes it worse > > I've got a prototype patch that implements the two improvements to keep > split-paths from making things worse. That will improve things, but to > really do a good job we'll have to either do something about PRE or have a > pass after PRE undo PRE's deoptimization. I don't think it's per-se a PRE "deoptimization", it's simply what PRE is supposed to do. The result should be quite optimal given we should be able to coalesce _52 and _15 and thus require no edge copies into bb 5. if (_16 != 0) goto ; [50.00%] else goto ; [50.00%] [local count: 531502203]: _52 = _15 ^ 2567483615; [local count: 1063004407]: # prephitmp_53 = PHI <_52(4), _15(3)> with a "reasonable" CPU we'd end up with sth like test r1, p1 // _16 != 0 into predicate reg p1 xor r2, 2567483615, p1 // xor predicated with p1 now of course path splitting ruins things here (I never ever liked that pass, it's very low-level transform is more suitable for RTL). And if we'd get conditional execution modeled "properly" in GIMPLE we could if-convert there, preventing path-splitting from mucking up things here...
[Bug tree-optimization/80520] [7/8 Regression] Performance regression from missing if-conversion
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80520 Jeffrey A. Law changed: What|Removed |Added CC||rguenth at gcc dot gnu.org --- Comment #9 from Jeffrey A. Law --- So AFAICT there's two issues that need to be addressed. PRE and split-paths. First up is PRE. Compile the sample code from c#5/c#6 with -O3 -fno-split-paths Prior to PRE we have: if (_16 != 0) goto ; [50.00%] else goto ; [50.00%] [local count: 531502203]: [local count: 1063004407]: # iftmp.0_19 = PHI <2567483615(3), 0(4)> _17 = _15 ^ iftmp.0_19; That's actually reasonably good. While it's not a conditional move in the gimple. It's in a form will be easy for the RTL optimizers to handle and generate a suitable cmov if we just left it alone on x86_64. PRE (correctly) identifies that it can reduce the number of expression evaluations on the path traversing bb3->bb5 by hoisting the XOR with the non-zero constant into BB4 resulting in: if (_16 != 0) goto ; [50.00%] else goto ; [50.00%] [local count: 531502203]: _52 = _15 ^ 2567483615; [local count: 1063004407]: # iftmp.0_19 = PHI <2567483615(4), 0(3)> # prephitmp_53 = PHI <_52(4), _15(3)> That's correct, but far from ideal. So the second issue is split-paths. There's actually two problems to deal with in split-paths. As it stands today this is what we see in split-paths (as a result of the PRE de-optimization): [ ... ] if (_20 != 0) goto ; [50.00%] else goto ; [50.00%] [local count: 531502203]: _18 = _25 ^ 2567483615; [local count: 1063004407]: # prephitmp_49 = PHI <_25(3), _18(4)> _2 = (void *) ivtmp.8_30; MEM[base: _2, offset: 0B] = prephitmp_49; ivtmp.8_29 = ivtmp.8_30 + 8; if (ivtmp.8_29 != _6) goto ; [98.99%] else goto ; [1.01%] split-paths should try not to muck it up further. Note that we can probably identify this half-diamond pretty easily. bb3 dominates bb4. bb4 has a single statement that feeds a PHI in bb5. That's a very likely if-conversion candidate so split-paths ought to leave it alone. If we were to fix PRE then split-paths would be presented with something like this: [ ... ] if (_47 != 0) goto ; [50.00%] else goto ; [50.00%] [local count: 531502203]: [local count: 1063004407]: # iftmp.0_48 = PHI <2567483615(3), 0(4)> _49 = _18 ^ iftmp.0_48; ISTM that when either of the blocks in question (bb3 bb4) has *no* statements, with a single pred that is the other block then split-blocks definitely should leave it alone as well. So, to summarize. 1. PRE mucks things up a bit. 2. split-paths makes it worse I've got a prototype patch that implements the two improvements to keep split-paths from making things worse. That will improve things, but to really do a good job we'll have to either do something about PRE or have a pass after PRE undo PRE's deoptimization.
[Bug tree-optimization/80520] [7/8 Regression] Performance regression from missing if-conversion
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80520 Jeffrey A. Law changed: What|Removed |Added CC||law at redhat dot com --- Comment #8 from Jeffrey A. Law --- I'd look more at split-paths rather than DOM in this case. Prior to split-paths we have: [ ... ] _20 = y_33 & 1; if (_20 != 0) goto ; [50.00%] else goto ; [50.00%] [local count: 531502203]: _18 = _25 ^ 2567483615; [local count: 1063004407]: # prephitmp_49 = PHI <_25(3), _18(4)> _2 = (void *) ivtmp.8_30; [ ... ] Split-paths comes along and decides it'd like split BB5 resulting in: _20 = y_33 & 1; if (_20 != 0) goto ; [50.00%] else goto ; [50.00%] [local count: 531502203]: _18 = _25 ^ 2567483615; [local count: 531502203]: # prephitmp_43 = PHI <_18(4)> _47 = (void *) ivtmp.8_30; MEM[base: _47, offset: 0B] = prephitmp_43; ivtmp.8_42 = ivtmp.8_30 + 8; if (ivtmp.8_42 != _6) goto ; [98.99%] else goto ; [1.01%] [local count: 531502204]: # prephitmp_49 = PHI <_25(3)> _2 = (void *) ivtmp.8_30; MEM[base: _2, offset: 0B] = prephitmp_49; ivtmp.8_29 = ivtmp.8_30 + 8; if (ivtmp.8_29 != _6) goto ; [98.99%] else goto ; [1.01%] I don't see any simplifications that happen as a result of that transformation and we're not really even going to be able to simplify the branching structure. So ISTM that split-paths is where we ought to concentrate. It's been problematic in this space before -- it's got a number of hacks already to try and discourage path splitting in cases where doing so it going to inhibit if-conversion.
[Bug tree-optimization/80520] [7/8 Regression] Performance regression from missing if-conversion
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80520 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org, ||law at gcc dot gnu.org --- Comment #7 from Jakub Jelinek --- This changed (looking at the #c6 testcase) with r242550, and the change was intentional, tree if conversion except when it enables successful vectorization is usually harmful, only rarely useful, but this is one of such cases. You can get back the previous behavior with -O3 -ftree-loop-if-convert. The reason why the RTL CE stuff doesn't do anything here is that we end up with multiple statements in the if block, so instead of: if (_20 != 0) goto ; else goto ; : _18 = _25 ^ 2567483615; : # _200 = PHI <_25(4), _18(4)> MEM[base: _3, offset: 0B] = _200; ivtmp.11_29 = ivtmp.11_30 + 8; if (_6 == ivtmp.11_29) goto ; else goto ; we get: if (_20 != 0) goto ; else goto ; : _18 = _25 ^ 2567483615; MEM[base: _3, offset: 0B] = _18; ivtmp.11_42 = ivtmp.11_30 + 8; if (_6 == ivtmp.11_42) goto ; else goto ; : MEM[base: _3, offset: 0B] = _25; ivtmp.11_29 = ivtmp.11_30 + 8; if (_6 == ivtmp.11_29) goto ; else goto ; This is created by dom, I wonder what benefit is in this case. Even if we don't improve it in ifcvt.c, if we can make the bb 5 with just the xor fallthrough into bb 6, i.e. the conditional branch just jumps over the (single insn), then that looks more beneficial to duplicating more stmts. Jeff? Though, seems that multiple passes are keen on doing this kind of stuff, so in order to avoid that I have to use: -O3 -fno-tree-dominator-opts -fno-tree-vrp -fno-split-paths With that we get: andl$1, %edx je .L2 xorq%r8, %rax .L2: movq%rax, (%rdi) addq$8, %rdi cmpq%rdi, %rsi jne .L3 which is IMHO better, but still not the cmov. The conditional block contains in that case: (insn 20 19 21 5 (set (reg:DI 105) (const_int 2567483615 [0x9908b0df])) 85 {*movdi_internal} (nil)) (insn 21 20 22 5 (parallel [ (set (reg:DI 93 [ _25 ]) (xor:DI (reg:DI 93 [ _25 ]) (reg:DI 105))) (clobber (reg:CC 17 flags)) ]) 443 {*xordi_1} (expr_list:REG_DEAD (reg:DI 105) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil which is too much for ifcvt. If I change the #c6 testcase to: void foo(unsigned long *M) { for (unsigned long k = 0; k < 227; ++k) { unsigned long y = ((M[k] & 0x8000) | (M[k + 1] & 0x7fff)); M[k] = (M[k + 397] ^ (y >> 1) ^ ((y & 1) ? 567483615 : 0)); } } so that the immediate fits into x86_64 signed 32-bit immediate, then we have just: (insn 20 19 21 5 (parallel [ (set (reg:DI 93 [ _25 ]) (xor:DI (reg:DI 93 [ _25 ]) (const_int 567483615 [0x21d31cdf]))) (clobber (reg:CC 17 flags)) ]) 443 {*xordi_1} (expr_list:REG_UNUSED (reg:CC 17 flags) (nil))) in the conditional block and ifcvt.c can deal with that and we get: movq%rax, %rsi xorq$567483615, %rsi andl$1, %edx cmovne %rsi, %rax (of course disabling the jump threading and path splitting is still needed for this). So, if we can do something about those, perhaps we could extend ifcvt so that it could deal with a set of a pseudo to a constant needed for the following instruction too and take it into account in the costs.
[Bug tree-optimization/80520] [7/8 Regression] Performance regression from missing if-conversion
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80520 Richard Biener changed: What|Removed |Added Target Milestone|7.2 |7.3
[Bug tree-optimization/80520] [7/8 Regression] Performance regression from missing if-conversion
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80520 Richard Biener changed: What|Removed |Added Target Milestone|7.2 |7.3 --- Comment #7 from Richard Biener --- GCC 7.2 is being released, adjusting target milestone. --- Comment #8 from Richard Biener --- GCC 7.2 is being released, adjusting target milestone.
[Bug tree-optimization/80520] [7/8 Regression] Performance regression from missing if-conversion
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80520 Richard Biener changed: What|Removed |Added Target Milestone|7.2 |7.3 --- Comment #7 from Richard Biener --- GCC 7.2 is being released, adjusting target milestone. --- Comment #8 from Richard Biener --- GCC 7.2 is being released, adjusting target milestone.
[Bug tree-optimization/80520] [7/8 Regression] Performance regression from missing if-conversion
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80520 --- Comment #5 from krister.walfridsson at gmail dot com --- I have extracted a smaller test case. The loops are generated from typedef mersenne_twister_engine< uint_fast32_t, 32, 624, 397, 31, 0x9908b0dfUL, 11, 0xUL, 7, 0x9d2c5680UL, 15, 0xefc6UL, 18, 1812433253UL> mt19937; and the expansion of the template end up with loops like void foo(unsigned long *M) { for (unsigned long k = 0; k < 227; ++k) { unsigned long y = ((M[k] & 0x8000) | (M[k + 1] & 0x7fff)); M[k] = (M[k + 397] ^ (y >> 1) ^ ((y & 1) ? 2567483615 : 0)); } } which generates the dump described in the bug report. --- Comment #6 from krister.walfridsson at gmail dot com --- I have extracted a smaller test case. The loops are generated from typedef mersenne_twister_engine< uint_fast32_t, 32, 624, 397, 31, 0x9908b0dfUL, 11, 0xUL, 7, 0x9d2c5680UL, 15, 0xefc6UL, 18, 1812433253UL> mt19937; and the expansion of the template end up with loops like void foo(unsigned long *M) { for (unsigned long k = 0; k < 227; ++k) { unsigned long y = ((M[k] & 0x8000) | (M[k + 1] & 0x7fff)); M[k] = (M[k + 397] ^ (y >> 1) ^ ((y & 1) ? 2567483615 : 0)); } } which generates the dump described in the bug report.
[Bug tree-optimization/80520] [7/8 Regression] Performance regression from missing if-conversion
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80520 --- Comment #5 from krister.walfridsson at gmail dot com --- I have extracted a smaller test case. The loops are generated from typedef mersenne_twister_engine< uint_fast32_t, 32, 624, 397, 31, 0x9908b0dfUL, 11, 0xUL, 7, 0x9d2c5680UL, 15, 0xefc6UL, 18, 1812433253UL> mt19937; and the expansion of the template end up with loops like void foo(unsigned long *M) { for (unsigned long k = 0; k < 227; ++k) { unsigned long y = ((M[k] & 0x8000) | (M[k + 1] & 0x7fff)); M[k] = (M[k + 397] ^ (y >> 1) ^ ((y & 1) ? 2567483615 : 0)); } } which generates the dump described in the bug report. --- Comment #6 from krister.walfridsson at gmail dot com --- I have extracted a smaller test case. The loops are generated from typedef mersenne_twister_engine< uint_fast32_t, 32, 624, 397, 31, 0x9908b0dfUL, 11, 0xUL, 7, 0x9d2c5680UL, 15, 0xefc6UL, 18, 1812433253UL> mt19937; and the expansion of the template end up with loops like void foo(unsigned long *M) { for (unsigned long k = 0; k < 227; ++k) { unsigned long y = ((M[k] & 0x8000) | (M[k + 1] & 0x7fff)); M[k] = (M[k + 397] ^ (y >> 1) ^ ((y & 1) ? 2567483615 : 0)); } } which generates the dump described in the bug report.
[Bug tree-optimization/80520] [7/8 Regression] Performance regression from missing if-conversion
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80520 Richard Biener changed: What|Removed |Added Depends on||80491 --- Comment #4 from Richard Biener --- Ok, so then this looks related (or dup) of PR80491. I've also seen a similar report that involves us no longer doing a specific inlining (similar in using ). Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80491 [Bug 80491] [6/7/8 Regression] Compiler regression for long-add case.
[Bug tree-optimization/80520] [7/8 Regression] Performance regression from missing if-conversion
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80520 --- Comment #3 from krister.walfridsson at gmail dot com --- You can see the issue in the generated code with int foo(std::mt19937 ) { std::uniform_int_distribution dist(0,99); return dist(gen); } too. I.e. it is not just an artifact of the uninteresting use in the benchmarking loop.
[Bug tree-optimization/80520] [7/8 Regression] Performance regression from missing if-conversion
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80520 Richard Biener changed: What|Removed |Added Priority|P3 |P2 Target Milestone|--- |7.2 --- Comment #2 from Richard Biener --- Looks like again (from just the quoted dumps) path splitting removes the RTL if-conversion opportunity here. Of course just because the testcase wraps the computation in a loop that does nothing interesting.
[Bug tree-optimization/80520] [7/8 Regression] Performance regression from missing if-conversion
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80520 Marc Glisse changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2017-04-26 Summary|Performance regression from |[7/8 Regression] |missing if-conversion |Performance regression from ||missing if-conversion Ever confirmed|0 |1 --- Comment #1 from Marc Glisse --- clang generates much slower code (and even worse with libc++). With a combination of -march=native and -fprofile-{generate,use}, I can get faster code from gcc-7 than from gcc-6, while the .optimized dump still has branches and not cond_expr.