[Bug rtl-optimization/70164] [6 Regression] Code/performance regression due to poor register allocation on Cortex-M0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70164 --- Comment #14 from Jeffrey A. Law --- Some further notes. I was looking at what the impact would be if we just stopped recording the problematical equivalences in CSE, both to see if the equivalences are useful at all, and if they are, to get a sense of when (which might perhaps lead to some useful conditions for recording them). I was quite surprised at how much of a difference in the resulting code generation these equivalences make. One of the things that is emerging is that these equivalences are useful when the copy propagation they enable allows one operand to die at the comparison when it didn't die before. That in turn may allow creation of REG_EQUIV note on the insn that initializes the dying register. Which then allows substitution of the equivalent memory for the register in the comparison. We still have the same number of memory references, but we use one less register in that case and have one less instruction. We obviously don't have that level of insight during CSE. But given uses/sets DF info, we can get a lot of the way there. Anyway, just wanted to record some of my findings. I'm putting this down for now as it's not likely to be a gcc-6 kind of change.
[Bug rtl-optimization/70164] [6 Regression] Code/performance regression due to poor register allocation on Cortex-M0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70164 Jeffrey A. Law changed: What|Removed |Added Priority|P1 |P2 --- Comment #13 from Jeffrey A. Law --- Essentially this is the same problem we have with DOM using context sensitive equivalences to copy propagate into subgraphs, but in CSE. I'm increasingly of the opinion that such equivalences DOM find should be used for simplification only, not for copy propagation. That opinion would apply for CSE as well. I'm not sure if we can put the pieces in place for gcc-6, but I think that's the direction we ought to be going. The alternative would be to do some kind of range splitting. What we'd want to know is do we have a context sensitive equivalency and would splitting the range in the dominated subgraph result in a graph that is more easily/better colorable. In this case, the subgraph creates all the conflicts so it's an obvious split point, but I'm not sure how easily we could generalize that. Either way I don't think this should be a release blocking issue. Moving to P2, but keeping the target gcc-6.
[Bug rtl-optimization/70164] [6 Regression] Code/performance regression due to poor register allocation on Cortex-M0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70164 --- Comment #11 from Jeffrey A. Law --- So given the conflicts during IRA I can't see a way for IRA to do a better job. Essentially the key allocno/pseudo wants hard reg 0 to avoid the spillage, but it also conflicts with hard reg 0. Prior to CSE1 we have the following key statements: (insn 15 14 16 2 (set (reg/v/f:SI 116 [ ]) (reg:SI 0 r0)) j.c:23 748 {*thumb1_movsi_insn} (nil)) [ ... (jump_insn 19 18 20 2 (set (pc) (if_then_else (ne (reg/v/f:SI 117 [ line ]) (reg/v/f:SI 116 [ ])) (label_ref:SI 36) (pc))) j.c:25 756 {cbranchsi4_insn} (int_list:REG_BR_PROB 8987 (nil)) -> 36) [ ... ] (insn 21 20 22 3 (set (reg:SI 0 r0) (reg/v/f:SI 117 [ line ])) j.c:25 748 {*thumb1_movsi_insn} (nil)) [ ... ] (insn 31 30 36 3 (set (reg/v/f:SI 116 [ ]) (reg:SI 0 r0)) j.c:25 748 {*thumb1_movsi_insn} (nil)) [ ... ] (insn 28 27 29 3 (set (reg:SI 1 r1) (reg/v/f:SI 117 [ line ])) j.c:25 748 {*thumb1_movsi_insn} (nil)) [ ... ] (insn 37 39 38 4 (set (reg/i:SI 0 r0) (reg/v/f:SI 116 [ ])) j.c:26 748 {*thumb1_movsi_insn} (nil)) (insn 38 37 0 4 (use (reg/i:SI 0 r0)) j.c:26 -1 (nil)) Of particular interest is that r116 is not live-in to bb3. If you do the full analysis, it can be shown that r116 does not conflict with r0 before cse1. And that's key because to get the code we want r116 needs to be assigned to r0. cse (like DOM) has the ability to look at a equality conditional and propagate equivalences into the true or false arm. ANd if we look at insn 19, we've got a equality conditional between r117 and r116 which will set up an equivalence between r116 and r117 for bb3. So in bb3, cse1 will change the initial assignment from: (insn 21 20 22 3 (set (reg:SI 0 r0) (reg/v/f:SI 117 [ line ])) j.c:25 748 {*thumb1_movsi_insn} (nil)) to: (insn 21 20 22 3 (set (reg:SI 0 r0) (reg/v/f:SI 116 [ ])) j.c:25 748 {*thumb1_movsi_insn} (nil)) Which makes r116 live-in for bb3. But note that it doesn't change insn 28 (yet). forwprop then comes along and changes insn 28 to look like: (insn 28 27 29 3 (set (reg:SI 1 r1) (reg/v/f:SI 116 [ ])) j.c:25 748 {*thumb1_movsi_insn} (expr_list:REG_DEAD (reg/v/f:SI 116 [ ]) (nil))) Now r116 is both live-in to bb3 and conflicts with r0 within bb3. At which point we have lost. I've got a couple things I want to poke at... But nothing that I think has a high probability of success.
[Bug rtl-optimization/70164] [6 Regression] Code/performance regression due to poor register allocation on Cortex-M0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70164 --- Comment #12 from Jeffrey A. Law --- Slight correction. I was looking at the wrong part of the dump when I said cse1 didn't change insn 28. It is cse1 that changes insn 28. So this is strictly an issue with the transformations cse1 makes.
[Bug rtl-optimization/70164] [6 Regression] Code/performance regression due to poor register allocation on Cortex-M0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70164 --- Comment #10 from Vladimir Makarov --- (In reply to Jeffrey A. Law from comment #9) > I think that's a fair characterization. The extra copy emitted by the older > compiler gives the allocator more freedom. With coalescing getting more > aggressive, the copy is gone and the allocator's freedom is reduced. > > I'll try to have a look at what the allocator is doing, but I doubt it's > realistically something that can be addressed in this release cycle. I am agree. It will be probably hard to fix in IRA on this stage. Coalescing is a controversial thing. Therefore there are so many coalescing algorithms. I've tried a lot of them when I worked on global RA. Finally, I found that implicit coalescing worked the best. The word `implicit` means that we propagate hard register preferences (through copies, including implicit ones for two-operand constraints) from already assigned pseudos to unassigned ones. When it is possible to assign the same hard register, we do it and remove the copies. Otherwise, we still can assign a hard register which might be impossible after we explicitly coalesced two pseudos. Only LRA does explicit coalescing for pseudos assigned to memory as we have no constraints on # stack slots and memory-memory moves are expensive and require additional hard reg. I guess probably this sort of PR could be fixed if we had live-range splitting in any place not only on the loop borders. But it might create other PRs if it makes a wrong decisions :) Unfortunately, it is all about heuristics. They can work successfully in one case and do bad things in another case. The performance of credible benchmarks should be a criterion.
[Bug rtl-optimization/70164] [6 Regression] Code/performance regression due to poor register allocation on Cortex-M0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70164 --- Comment #9 from Jeffrey A. Law --- I think that's a fair characterization. The extra copy emitted by the older compiler gives the allocator more freedom. With coalescing getting more aggressive, the copy is gone and the allocator's freedom is reduced. I'll try to have a look at what the allocator is doing, but I doubt it's realistically something that can be addressed in this release cycle.
[Bug rtl-optimization/70164] [6 Regression] Code/performance regression due to poor register allocation on Cortex-M0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70164 Richard Biener changed: What|Removed |Added Keywords||missed-optimization, ra Priority|P3 |P1 CC||vmakarov at gcc dot gnu.org --- Comment #8 from Richard Biener --- So it's an RA issue then which previously was mitigated by the extra copy (and thus split life-range).
[Bug rtl-optimization/70164] [6 Regression] Code/performance regression due to poor register allocation on Cortex-M0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70164 Jeffrey A. Law changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2016-03-10 CC||law at redhat dot com Ever confirmed|0 |1 --- Comment #7 from Jeffrey A. Law --- AFAICT the coalescing code is working as expected here. Working with r226900... So, the only real statement of interest is: # iftmp.0_1 = PHIWhich results in the following partition map: Partition 0 (iftmp.0_1 - 1 ) Partition 1 (line_7(D) - 7 ) Partition 2 (iftmp.0_18 - 18 ) And the following coalesce list: Coalesce list: (1)iftmp.0_1 & (18)iftmp.0_18 [map: 0, 2] : Success -> 0 Note that new_line_9 isn't ever processed. Which is a bit odd to say the least. Moving to r226901 we have: Partition 0 (iftmp.0_1 - 1 ) Partition 1 (line_7(D) - 7 ) Partition 2 (new_line_9 - 9 ) Partition 3 (iftmp.0_18 - 18 ) Coalesce list: (1)iftmp.0_1 & (9)new_line_9 [map: 0, 2] : Success -> 0 Coalesce list: (1)iftmp.0_1 & (18)iftmp.0_18 [map: 0, 3] : Success -> 0 Note that we coalesced new_line_9 into the same partition as iftmp.0_{1,18}. That seems valid given their use in the PHI and my quick review of the conflicts. So looking at the actual expansion r226900 will emit a copy from new_line_9 into iftmp.0_{1,18}. That's a result of r226900 not coalescing the objects. So AFAICT this isn't a coalescing issue, at least not at the gimple->rtl expansion point.
[Bug rtl-optimization/70164] [6 Regression] Code/performance regression due to poor register allocation on Cortex-M0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70164 Richard Biener changed: What|Removed |Added CC||law at gcc dot gnu.org, ||rguenth at gcc dot gnu.org Target Milestone|--- |6.0 Summary|Code/performance regression |[6 Regression] |due to poor register|Code/performance regression |allocation on Cortex-M0 |due to poor register ||allocation on Cortex-M0 --- Comment #6 from Richard Biener --- So probably another coalescing "fallout"