--- Comment #22 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org>
(In reply to rsand...@gcc.gnu.org from comment #21)
> Created attachment 43646 [details]
> Patch to reduce spills for Armv7
> (In reply to rsand...@gcc.gnu.org from comment #20)
> > (In reply to Wilco from comment #12)
> > > There are 2 separate issues in the ARMv7 case. One is scheduling, the -S
> > > output goes down from 437 lines to 305 lines with -fno-schedule-insns
> > > (stack
> > > size 276 rather than 448 bytes). So basically the "register pressure
> > > aware"
> > > scheduler introduces lots of unnecessary spills.
> > This is kind-of expected in general, though almost certainly wrong in this
> > case. The default "weighted" algorithm tended to overemphasise decreasing
> > spills (at the cost of decreasing ILP) and slowed down some important
> > benchmarks for which some spilling was better. The "model" algorithm was
> > supposed to be a compromise.
> > I'll have a look to see whether there's an easy way of handling this case
> > better without regressing others. (I'm not assigning myself since it's
> > unrelated to the x86 problem.)
> SCHED_PRESSURE_MODEL first tries to create a "model" schedule
> that keeps register down as far as possible and then uses that
> to guide the "real" schedule. It looks like the model schedule
> goes catastrophically wrong in this case though: the original
> order had a VFP_REGS pressure of 56 (against a capacity of 64)
> while the model schedule had a pressure of 76(!).
> I think the problem is that the algorithm was tuned on load/store
> style loops, where it was beneficial to keep the model schedule
> narrow and try to reach the eventual store (so killing off a
> whole chain). It doesn't cope well with so many accumulators,
> where completing the chain never leads to a reduction in pressure.
> The attached patch is a proof of concept that tries to handle
> this kind of situation better. The model schedule now gives
> a VFP_REGS pressure of 52 instead of 76, which is 4 below the
> unscheduled code. I'll try to give it more wider testing when
> I have time.
> Although the patch removes some of the spills, the scheduler
> still thinks that it's better to keep others. And in that
> sense it's working as intended, since as far as GCC's view
> of the pipeline is concerned, the spills give faster code.
> This can be seen by grepping for "total time" in the sched2
> dumps, which include the effect of all the spill code.
> The times for the inner loop in this test are:
> 307 cycles for the unpatched compiler (most spills)
> 355 cycles for the patched compiler (some spills)
> 398 cycles with -fno-schedule-insns (no spills)
> These were all with "-mcpu=cortex-a15 -O2" but the
> results are similar with other -mcpu options.
> So on GCC's own terms, using its model of the CPU,
> the current mega-spill code seems like a 25% win over
> the spill-free code. That's probably not true in practice,
> but the scheduler can only work within the description
> it's given.
Sorry, forgot to say that all the above was with Wilco's
vdup_n_f32 modification, to work around the arm_neon.h problem.