On Fri, Aug 28, 2015 at 5:48 PM, Jeff Law <[email protected]> wrote:
> On 08/28/2015 09:43 AM, Simon Dardis wrote:
>
>> Following Jeff's advice[1] to extract more information from GCC, I've
>> narrowed the cause down to the predictive commoning pass inserting
>> the load in a loop header style basic block. However, the next pass
>> in GCC, tree-cunroll promptly removes the loop and joins the loop
>> header to the body of the (non)loop. More oddly, disabling
>> conditional store elimination pass or the dominator optimizations
>> pass or disabling of jump-threading with --param
>> max-jump-thread-duplication-stmts=0 nets the above assembly code. Any
>> ideas on an approach for this issue?
>
> I'd probably start by looking at the .optimized tree dump in both cases to
> understand the difference, then (most liklely) tracing that through the RTL
> optimizers into the register allocator.
It's the known issue of LIM (here the one after pcom and complete unrolling of
the inner loop) being too aggressive with store-motion. Here the comptete
array is replaced with registers for the outer loop. Were 'poly' a
local variable
we'd have optimized it away completely.
<bb 6>:
_8 = 1.0e+0 / pretmp_42;
_12 = _8 * _8;
poly[1] = _12;
<bb 7>:
# prephitmp_30 = PHI <_12(6), _36(9)>
# T_lsm.8_22 = PHI <_8(6), pretmp_42(9)>
poly_I_lsm0.10_38 = MEM[(double *)&poly + 8B];
_2 = prephitmp_30 * poly_I_lsm0.10_38;
_54 = _2 * poly_I_lsm0.10_38;
_67 = poly_I_lsm0.10_38 * _54;
_80 = poly_I_lsm0.10_38 * _67;
_93 = poly_I_lsm0.10_38 * _80;
_106 = poly_I_lsm0.10_38 * _93;
_19 = poly_I_lsm0.10_38 * _106;
count_23 = count_28 + 1;
if (count_23 != iterations_6(D))
goto <bb 5>;
else
goto <bb 8>;
<bb 8>:
poly[2] = _2;
poly[3] = _54;
poly[4] = _67;
poly[5] = _80;
poly[6] = _93;
poly[7] = _106;
poly[8] = _19;
i1 = 9;
T = T_lsm.8_22;
note that DOM misses to CSE poly[1] (a known defect), but heh, doing that
would only increase register pressure even more.
Note the above is on x86_64.
Richard.
> jeff