[0/4] Modulo scheduling with haifa-sched for C6X

Bernd Schmidt Tue, 13 Sep 2011 14:49:51 -0700

C6X has some rather nifty hardware support for modulo scheduling.
Consider the following loop:


.L13:
                ldh     .d1t1   *++A5[1], A7
        ||      add     .s1     -1, A0, A0
                ldh     .d2t1   *B5++[1], A8
                nop     1
        [A0]    b       .s1     .L13
                nop     2
                mpy     .m1     A8, A7, A9
                nop     1
                add     .d1     A3, A9, A3
        ;; condjump to .L13 occurs

This takes 9 cycles per iteration. Note that the back branch is not the
last insn in the block, it has five cycles worth of delay slots.

A fully optimized version of this making use of the loop pipelining
hardware looks like this:

                [ load ILC register some cycles previously ]
                sploop  1
.L13:
                ldh     .d1t2   *++A5[1], B6
        ||      ldh     .d2t1   *B5++[1], A8
                nop     4
        ;; load to A8 occurs
        ;; load to B6 occurs
                mpy     .m1x    A8, B6, A9
                nop     1
        ;; multiplication occurs and stores to A9
                spkernel        7, 0
        ||      add     .s1     A3, A9, A3

The loop is contained between the sploop and spkernel instructions (with
one instruction executing in parallel with the spkernel). Instructions
are copied to a loop buffer and reexecuted from there, based on the
initiation interval which is given as an argument to the sploop
instruction. In this case, the II is 1, which means we take one cycle
per loop iteration (plus prologue/epilogue costs obviously). Once the
loop buffer is full, every iteration of the loop executes in one cycle
as follows:

                ldh     .d1t2   *++A5[1], B6
        ||      ldh     .d2t1   *B5++[1], A8
        ||      mpy     .m1x    A8, B6, A9
        ||      add     .s1     A3, A9, A3

Note the unit reservations: d1, d2, t1, t2, m1, x1, and s1; no unit is
reserved twice.

The reason this works is because of the machine's exposed pipeline. It's
not only branches that have delay slots, but loads and multiplications
as well; really every instruction that takes more than a cycle. The
result of a load is ready after five cycles, and in the loop above, it
is immediately consumed afterwards. No register value has a lifetime of
more than one cycle. This, together with the choice of reservations,
allows an optimal initiation interval of 1. (The sploop hardware knows
how to deal with interrupts. In general, it's not safe to expose delay
slots for anything other than branches outside a sploop/spkernel loop.)

I have added support for this to haifa-sched.c. I expect the question
"why not use SMS" to come up; there were a number of reasons why I felt
that code is unsuitable:

There are (or were, when I started) some glaring weaknesses in SMS, such
as giving up when the loop contains autoincrement addresses (which is
the case in the example above), and by the looks of it fairly poor
memory disambiguation compared to sched-ebb with cselib.

Correctness is also an issue. The ps_has_conflicts function pretends to
verify that it generates a valid schedule, but it ignores half the
target hooks, which means it would likely produce incorrect code on C6X.
 We'd also have to duplicate the delayed-shadow pair mechanism which was
added to haifa-sched here as well to cope with load/multiply delay slots.

Finally, SMS really runs too early: C6X has an exposed pipeline and
requires an exact schedule including unit reservations based on register
allocation; it won't do to try to use a schedule that was produced
before IRA. Using SMS after register allocation seems tricky at least
since it wants to create new pseudos, and using a normal haifa-sched
pass later to fix up the schedule may "work" on other targets as it
produce a valid straight line schedule (however suboptimal for the
original goal of modulo scheduling), but it can only fail on C6X.

Making haifa-sched usable for modulo scheduling wasn't actually that
hard once I realized that I had already added support for fixed delays
between insns, for the C6X delay slot support. All we really need is to
extend that to have two different possible reasons for a fixed delay. On
top of the existing delay shadow mechanism, we can record a stage. For a
given II which we're trying, (II * stage) gives the delay between two
insns. The target code unrolls the loop to a factor that seems
reasonable and calls record_delay_pair for the new insns. Almost
everything else is in place, we just need to accept the possibility that
it may not be possible to find a valid schedule for a given II.

Backtracking is helpful but must be limited to avoid exponential
blow-ups; the param I've chosen seems not to affect performance on any
of the benchmarks I've tried and solves the (rare in the first place)
compile-time problems I ran across.

There will be four patches in the series:
1/4: haifa-sched support
2/4: Make schedule_ebb callable from outside the scheduler
3/4: A hw-doloop bugfix
4/4: C6X pieces.

I've tested them together on c6x-elf. They've been in our 4.5 based
branch for about half a year or so; we've successfully built kernels,
uClibc and applications with it. There'll be some followup patches later
to make use of the new regrename-across-loops functionality to produce
better unit reservations.


Bernd

[0/4] Modulo scheduling with haifa-sched for C6X

Reply via email to