I don't remember if this has already been experimented with; I've
lost track of what the previous sets of benchmarks were for. I was
curious to see how much faster a single-function switch based interpreter
would go. I took process_opfunc.pl and created process_opfunc_switch.pl
which does the work. I added a few additional optimizations such as
copying many of the interpreter variables into local variables (to
reduce the indirection by one). I then redefined the register access
macros locally to account for this. I also mungled "RETURN" keyword to
directly increment the code pointer. The only problem I encountered was
that the existing parser wasn't designed to adapt well to the new "nested
code", so the comments outside the functional blocks (of
basic_opcodes.ops) are no longer are near their respective code. (Also
identified a bug in process_opfunc.pl, which I'll post patches for later).
Using a 466MHZ dual celeron and a 600M instruction loop (bench.jako with
different limits):
Stock-code default compiler options 8.2M ops/s
Stock-code -O3 with the works 13.5M ops/s
switch-based default compiler options 11.5M ops/s
switch-based -O3 with the works 28.9M ops/s
# for reference a perl -e '...' was run with equivalent code
perl5.6 (redhat built -O2) 4.2M ops/s
I actually run code with and without the bastardized local copies of
INT_REG, NUM_REG. W/ default compiler options this made a difference of
several Mops/s, w/ -O3 the time difference was neglegable.
One additional optimization that I'm going to try and make: The *.ops
code is not ordered with respect to the op-code-id, so the switch
statement couldn't be optimized by gcc (it just created a local
jump-table, which essentially is the same as vtable[*code] minus the minor
overhead of subroutine stack operations (which occur in parallel). I'm
going to redesign the compiler such that the switch statement is
numerically ordered. I believe this will cause a direct PC + *code jump
statement. Lastly the highly tuned compiler decided to litter the code
with branch statements. I'm guessing it was to maximize code-reuse
(possible reduction in cache size??), though this thwarts much of the
point of having code blocks close to each other.
The end result is that the performance gain is not spectacular (at least
with this minimally cached celeron). The test times were between 140s and
23s with the lower ends having only 10s of seconds between them. There was
a varience of 2 or more seconds for consecutive executions, so this isn't
definitive.
I'm away from my usual internet connection so it' inconvinient to post
patches and files. If anyone really wants to see them, I'll make a
special effort.
-Michael