[Milkymist-devel] improved PFPU scheduler

Werner Almesberger Fri, 23 Sep 2011 18:58:27 -0700

When discussing the long time that passes between starting a
Milkymist and the appearance of the first patch, patch compilation
was highlighted as one source of trouble.


I had a look at the compiler and found that the scheduler, the
program that places the GFPU instructions in a suitable order and
performs register allocations, uses a very clear and understandable
but unfortunately also rather inefficient algorithm.

I've now written a drop-in replacement for gfpus.c that is faster
and that also produces a more efficient schedule most of the time.

My scheduler is here:

http://projects.qi-hardware.com/index.php/p/wernermisc/source/tree/master/m1/perf/sched.c

It has an optional optimization capability that takes a bit of CPU
time and usually produces better code than the version without
optimization. What it does is that it tries to prefer operations
that have a long chain of further operations depend on them, so
that we have a better chance of having a supply of operations to
fill delay slots.

I've tested that it generates code that is functionally equivalent
to the original by building a minimum patch compilation environment
on my x86-64 PC and comparing the content of the fixed registers at
the end of the code sequence. Here's the script that reads the
COMP_DEBUG output and determines what a code sequence does:

http://projects.qi-hardware.com/index.php/p/wernermisc/source/tree/master/m1/perf/eval.pl

Here is a performance comparison between the original, the new
scheduler without optimization, and the new scheduler with
optimization (on a Q6600, without -O, with -pg):

http://downloads.qi-hardware.com/people/werner/m1/perf/chart-20110924.html

"Time" is the run time for 10'000 compilation cycles. "Size" is the
size of the instruction sequence, "Eff" the efficiency respective to
the idealized FPVM code (see 6.4 section of Sebastien's thesis for
details), "Regs" the number of PFPU registers allocated, "Equiv" the
check for functional equivalence, and "Name" a sanitized version of
the patch file's name.

On the M1, I've obtained the following run times for compiling all
the patches but the last one (it was also compiled, but I'm too lazy
to get the time for it ;-):

Original          10.5 s
New scheduler      4.0 s
With optimization  4.1 s

The difference is less dramatical than in the performance chart.
I suspect that the parsing overhead could play a bigger role on the
M1 than in my test environment. The parser also offers rich
potential for optimizations.

With my scheduler, the total time between starting the M1 and the
first patch drops from about 35 seconds to about 29 seconds. Quite a
lot of time (~16 s) seems to be spent waiting for BOOTP. I'm using
Flickernoise stable_1.0. Not sure if the BOOTP wait has been fixed
in later versions.

- Werner
_______________________________________________
http://lists.milkymist.org/listinfo.cgi/devel-milkymist.org
IRC: #milkymist@Freenode

[Milkymist-devel] improved PFPU scheduler

Reply via email to