https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256
amker at gcc dot gnu.org changed:
What|Removed |Added
CC||amker at gcc dot gnu.org
--- Comment #52 from amker at gcc dot gnu.org ---
I don't understand powerpc assembly well, but this looks like the same problem
on aarch64/arm. Ah, and we are even looking at same function...
I think this is a general issue caused by inconsistency between tree level
ivopt and rtl level loop unroller. To be specific, how we handle unrolled
induction variable registers after unrolling.
The core loop on aarch64 with options -O3 -funroll-all-loops -mcpu=cortex-a57
gave below output:
.L3:
add x2, x0, 16
ldr q16, [x17, x0]
add x10, x0, 32
add x9, x0, 48
add x8, x0, 64
ldr q17, [x17, x2]
add x3, x0, 80
add x6, x0, 96
add x5, x0, 112
add w1, w1, 8
ldr q19, [x17, x10]
cmp w1, w14
ldr q18, [x17, x9]
ldr q20, [x17, x8]
ldr q21, [x17, x3]
ldr q22, [x17, x6]
ldr q23, [x17, x5]
str q16, [x18, x0]
add x0, x0, 128
str q17, [x18, x2]
str q19, [x18, x10]
str q18, [x18, x9]
str q20, [x18, x8]
str q21, [x18, x3]
str q22, [x18, x6]
str q23, [x18, x5]
bcc .L3
The tree ivopt dump is quite neat:
bb 6:
# ivtmp.16_28 = PHI ivtmp.16_25(9), 0(5)
# ivtmp.19_42 = PHI ivtmp.19_41(9), 0(5)
vect__4.13_62 = MEM[base: vectp_a.12_58, index: ivtmp.19_42, offset: 0B];
MEM[base: vectp_c.15_63, index: ivtmp.19_42, offset: 0B] = vect__4.13_62;
ivtmp.16_25 = ivtmp.16_28 + 1;
ivtmp.19_41 = ivtmp.19_42 + 16;
if (ivtmp.16_25 bnd.7_36)
goto bb 9;
else
goto bb 7;
...
bb 9:
goto bb 6;
But after rtl unroller, we have options like -fsplit-ivs-in-unroller and
-fweb. These two options try to split the long live range of induction
vairables into seperated ones. Evetually, with folloing fwprop and IRA, we
have multiple ivs for each original iv.
I see two possible fixes here. One is to implement a tree level unroller
before IVOPT and remove the rtl one. The rtl one is some kind of too
aggressive that we didn't enable it by default with O3.
Another is change how we handle unrolled iv in rtl unroller. It splits
unrolled iv to avoid pseudo register with long live range since that may affect
rtl optimizers. This assumption may hold before, but seems not true to me
nowadays, especially for induction variables. Because on tree level ivopts, we
already made the assumption that each iv occupies a register, also ivs are
intensively used thus should live in one single hard register. For this
specific case, we can refactor [base+index] out of memory reference and use
[new_base], [new_base+4], [new_base+8], ... etc. in unrolling. If tree ivopts
choosses [reg+offset] addressing mode, we only need to generate instruction
sequence like [reg+offset], [reg+(offset+4)], [reg+(offset+8)]... reg = reg +
urolled_times*step
Thanks,
bin