--- Comment #21 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org>
Created attachment 43646
Patch to reduce spills for Armv7
(In reply to rsand...@gcc.gnu.org from comment #20)
> (In reply to Wilco from comment #12)
> > There are 2 separate issues in the ARMv7 case. One is scheduling, the -S
> > output goes down from 437 lines to 305 lines with -fno-schedule-insns (stack
> > size 276 rather than 448 bytes). So basically the "register pressure aware"
> > scheduler introduces lots of unnecessary spills.
> This is kind-of expected in general, though almost certainly wrong in this
> case. The default "weighted" algorithm tended to overemphasise decreasing
> spills (at the cost of decreasing ILP) and slowed down some important
> benchmarks for which some spilling was better. The "model" algorithm was
> supposed to be a compromise.
> I'll have a look to see whether there's an easy way of handling this case
> better without regressing others. (I'm not assigning myself since it's
> unrelated to the x86 problem.)
SCHED_PRESSURE_MODEL first tries to create a "model" schedule
that keeps register down as far as possible and then uses that
to guide the "real" schedule. It looks like the model schedule
goes catastrophically wrong in this case though: the original
order had a VFP_REGS pressure of 56 (against a capacity of 64)
while the model schedule had a pressure of 76(!).
I think the problem is that the algorithm was tuned on load/store
style loops, where it was beneficial to keep the model schedule
narrow and try to reach the eventual store (so killing off a
whole chain). It doesn't cope well with so many accumulators,
where completing the chain never leads to a reduction in pressure.
The attached patch is a proof of concept that tries to handle
this kind of situation better. The model schedule now gives
a VFP_REGS pressure of 52 instead of 76, which is 4 below the
unscheduled code. I'll try to give it more wider testing when
I have time.
Although the patch removes some of the spills, the scheduler
still thinks that it's better to keep others. And in that
sense it's working as intended, since as far as GCC's view
of the pipeline is concerned, the spills give faster code.
This can be seen by grepping for "total time" in the sched2
dumps, which include the effect of all the spill code.
The times for the inner loop in this test are:
307 cycles for the unpatched compiler (most spills)
355 cycles for the patched compiler (some spills)
398 cycles with -fno-schedule-insns (no spills)
These were all with "-mcpu=cortex-a15 -O2" but the
results are similar with other -mcpu options.
So on GCC's own terms, using its model of the CPU,
the current mega-spill code seems like a 25% win over
the spill-free code. That's probably not true in practice,
but the scheduler can only work within the description