https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359
--- Comment #39 from Aldy Hernandez <aldyh at gcc dot gnu.org> --- (In reply to [email protected] from comment #38) > On Thu, 15 Mar 2018, aldyh at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359 > > > > --- Comment #37 from Aldy Hernandez <aldyh at gcc dot gnu.org> --- > > Hi Richi. > > > > (In reply to [email protected] from comment #31) > > > > > I'd have not restricted the out-of-loop IV use to IV +- CST but > > > instead did the transform > > > > > > + LOOP: > > > + # p_8 = PHI <p_16(2), p_INC(3)> > > > + ... > > > + p_INC = p_8 - 1; > > > + goto LOOP; > > > + ... p_8 uses ... > > > > > > to > > > > > > + LOOP: > > > + # p_8 = PHI <p_16(2), p_INC(3)> > > > + ... > > > + p_INC = p_8 - 1; > > > + goto LOOP; > > > newtem_12 = p_INC + 1; // undo IV increment > > > ... p_8 out-of-loop p_8 uses replaced with newtem_12 ... > > > > > > so it would always work if we can undo the IV increment. > > > > > > The disadvantage might be that we then rely on RTL optimizations > > > to combine the original out-of-loop constant add with the > > > newtem computation but I guess that's not too much to ask ;) > > > k > > > > It looks like RTL optimizations have a harder time optimizing things when I > > take the above approach. > > > > Doing what you suggest, we end up optimizing this (simplified for brevity): > > > > <bb 3> > > # p_8 = PHI <p_16(2), p_19(3)> > > p_19 = p_8 + 4294967295; > > if (ui_7 > 9) > > goto <bb 3>; [89.00%] > > ... > > > > <bb 5> > > p_22 = p_8 + 4294967294; > > MEM[(char *)p_19 + 4294967295B] = 45; > > > > into this: > > > > <bb 3>: > > # p_8 = PHI <p_16(2), p_19(3)> > > p_19 = p_8 + 4294967295; > > if (ui_7 > 9) > > ... > > > > <bb 4>: > > _25 = p_19 + 1; ;; undo the increment > > ... > > > > <bb 5>: > > p_22 = _25 + 4294967294; > > MEM[(char *)_25 + 4294967294B] = 45; > > shouldn't the p_19 in the MEM remain? Why a new BB for the adjustment? > (just asking...) Oh, I was adjusting/propagating that one as a special case to see if the [_25 + 4294967294] could be CSE'd somehow by the RTL optimizers. But even with the p_19 remaining, the result is the same in the assembly. For the record, with the p_19 in place, we would get (see below for a more complete dump): p_22 = _25 + 4294967294; MEM[(char *)p_19 + 4294967295B] = 45; There is no new BB. That was the result of adding the TMP on the non-back edge. My simplification of the gimple IL for illustration purposes made that bit unclear. The unadulterated original was: <bb 3> [local count: 1073741825]: # ui_7 = PHI <ui_13(2), ui_21(3)> # p_8 = PHI <p_16(2), p_19(3)> _4 = ui_7 % 10; _5 = (char) _4; p_19 = p_8 + 4294967295; _6 = _5 + 48; MEM[base: p_19, offset: 0B] = _6; ui_21 = ui_7 / 10; if (ui_7 > 9) goto <bb 3>; [89.00%] else goto <bb 4>; [11.00%] <bb 4> [local count: 118111601]: if (i_12(D) < 0) goto <bb 5>; [41.00%] else goto <bb 6>; [59.00%] <bb 5> [local count: 48425756]: p_22 = p_8 + 4294967294; MEM[(char *)p_19 + 4294967295B] = 45; with a corresponding transformation (keeping the p_19 as you inquired) of: <bb 3> [local count: 1073741825]: # ui_7 = PHI <ui_13(2), ui_24(3)> # p_8 = PHI <p_16(2), p_19(3)> _4 = ui_7 % 10; _5 = (char) _4; p_19 = p_8 + 4294967295; _6 = _5 + 48; MEM[base: p_19, offset: 0B] = _6; ui_21 = ui_7 / 10; ui_24 = ui_21; if (ui_7 > 9) goto <bb 3>; [89.00%] else goto <bb 4>; [11.00%] <bb 4> [local count: 118111601]: _25 = p_19 + 1; ;; undo the increment on the non-back edge if (i_12(D) < 0) goto <bb 5>; [41.00%] else goto <bb 6>; [59.00%] <bb 5> [local count: 48425756]: p_22 = _25 + 4294967294; MEM[(char *)p_19 + 4294967295B] = 45; > > > I haven't dug into the RTL optimizations, but the end result is that we only > > get one auto-dec inside the loop, and some -2 indexing outside of it: > > > > strb r1, [r4, #-1]! > > lsr r3, r3, #3 > > bhi .L4 > > cmp r6, #0 > > movlt r2, #45 > > add r3, r4, #1 > > strblt r2, [r3, #-2] > > sublt r4, r4, #1 > > > > as opposed to mine: > > > > <bb 3>: > > # p_8 = PHI <p_16(2), p_19(3)> > > p_19 = p_8 + 4294967295; > > if (ui_7 > 9) > > ... > > > > <bb 5>: > > p_22 = p_19 + 4294967295; > > *p_22 = 45; > > > > which gives us two auto-dec, and much tighter code: > > > > strb r1, [r4, #-1]! > > lsr r3, r3, #3 > > bhi .L4 > > cmp r6, #0 > > movlt r3, #45 > > strblt r3, [r4, #-1]! > > > > Would it be OK to go with my approach, or is worth looking into the rtl > > optimizers and seeing what can be done (boo! :)). > > Would be interesting to see what is causing things to break. If it's > only combine doing the trick then it will obviously be multi-uses > of the adjustment pseudo. But eventually I thought fwprop would > come to the rescue... I could post the WIP for this approach if anyone is interested. > > I think a good approach would be to instead of just replacing > all out-of-loop uses with the new SSA name doing the adjustment > check if the use is in a "simple" (thus SSA name +- cst) stmt > and there forward the adjustment. > > So have the best of both worlds. I would have hoped this > extra complication isn't needed but as you figured... Hmmm, can we go with my first patch (cleaned up) instead, as it's already working? Thanks.
