C6X has some rather nifty hardware support for modulo scheduling. Consider the following loop:
.L13: ldh .d1t1 *++A5[1], A7 || add .s1 -1, A0, A0 ldh .d2t1 *B5++[1], A8 nop 1 [A0] b .s1 .L13 nop 2 mpy .m1 A8, A7, A9 nop 1 add .d1 A3, A9, A3 ;; condjump to .L13 occurs This takes 9 cycles per iteration. Note that the back branch is not the last insn in the block, it has five cycles worth of delay slots. A fully optimized version of this making use of the loop pipelining hardware looks like this: [ load ILC register some cycles previously ] sploop 1 .L13: ldh .d1t2 *++A5[1], B6 || ldh .d2t1 *B5++[1], A8 nop 4 ;; load to A8 occurs ;; load to B6 occurs mpy .m1x A8, B6, A9 nop 1 ;; multiplication occurs and stores to A9 spkernel 7, 0 || add .s1 A3, A9, A3 The loop is contained between the sploop and spkernel instructions (with one instruction executing in parallel with the spkernel). Instructions are copied to a loop buffer and reexecuted from there, based on the initiation interval which is given as an argument to the sploop instruction. In this case, the II is 1, which means we take one cycle per loop iteration (plus prologue/epilogue costs obviously). Once the loop buffer is full, every iteration of the loop executes in one cycle as follows: ldh .d1t2 *++A5[1], B6 || ldh .d2t1 *B5++[1], A8 || mpy .m1x A8, B6, A9 || add .s1 A3, A9, A3 Note the unit reservations: d1, d2, t1, t2, m1, x1, and s1; no unit is reserved twice. The reason this works is because of the machine's exposed pipeline. It's not only branches that have delay slots, but loads and multiplications as well; really every instruction that takes more than a cycle. The result of a load is ready after five cycles, and in the loop above, it is immediately consumed afterwards. No register value has a lifetime of more than one cycle. This, together with the choice of reservations, allows an optimal initiation interval of 1. (The sploop hardware knows how to deal with interrupts. In general, it's not safe to expose delay slots for anything other than branches outside a sploop/spkernel loop.) I have added support for this to haifa-sched.c. I expect the question "why not use SMS" to come up; there were a number of reasons why I felt that code is unsuitable: There are (or were, when I started) some glaring weaknesses in SMS, such as giving up when the loop contains autoincrement addresses (which is the case in the example above), and by the looks of it fairly poor memory disambiguation compared to sched-ebb with cselib. Correctness is also an issue. The ps_has_conflicts function pretends to verify that it generates a valid schedule, but it ignores half the target hooks, which means it would likely produce incorrect code on C6X. We'd also have to duplicate the delayed-shadow pair mechanism which was added to haifa-sched here as well to cope with load/multiply delay slots. Finally, SMS really runs too early: C6X has an exposed pipeline and requires an exact schedule including unit reservations based on register allocation; it won't do to try to use a schedule that was produced before IRA. Using SMS after register allocation seems tricky at least since it wants to create new pseudos, and using a normal haifa-sched pass later to fix up the schedule may "work" on other targets as it produce a valid straight line schedule (however suboptimal for the original goal of modulo scheduling), but it can only fail on C6X. Making haifa-sched usable for modulo scheduling wasn't actually that hard once I realized that I had already added support for fixed delays between insns, for the C6X delay slot support. All we really need is to extend that to have two different possible reasons for a fixed delay. On top of the existing delay shadow mechanism, we can record a stage. For a given II which we're trying, (II * stage) gives the delay between two insns. The target code unrolls the loop to a factor that seems reasonable and calls record_delay_pair for the new insns. Almost everything else is in place, we just need to accept the possibility that it may not be possible to find a valid schedule for a given II. Backtracking is helpful but must be limited to avoid exponential blow-ups; the param I've chosen seems not to affect performance on any of the benchmarks I've tried and solves the (rare in the first place) compile-time problems I ran across. There will be four patches in the series: 1/4: haifa-sched support 2/4: Make schedule_ebb callable from outside the scheduler 3/4: A hw-doloop bugfix 4/4: C6X pieces. I've tested them together on c6x-elf. They've been in our 4.5 based branch for about half a year or so; we've successfully built kernels, uClibc and applications with it. There'll be some followup patches later to make use of the new regrename-across-loops functionality to produce better unit reservations. Bernd