Summary
-------
My OCaml-hosted Bicicleta interpreter is now to the point that it can
run programs with integers and booleans, but it's about 8000 times
slower than the OCaml interpreter at doing simple arithmetic, and it
leaks huge quantities of space, apparently about a byte per method
call in my microbenchmark. I need to come up with a solution to the
excessive space usage if the system is going to be usable, even for
experimentation, and I may need to make it faster.
The interpreter is under 400 lines of OCaml, not counting unit tests
and some 250 lines of code in the Bicicleta language itself.
Without Attribute Caching
-------------------------
So I thought I'd try the dumb fib microbenchmark to get a vague idea
of how slow the current, tree-reduction-based OCaml implementation of
the Bicicleta language is. Here's the OCaml version:
let rec fib n = if n < 2 then 1 else fib (n-1) + fib (n-2) in
print_int (fib 32) ; print_newline() ;;
Byte-compiled, in PowerPC emulation, that runs at about 5 million
base-case recursions per second. (Since it adds together all the
values produced by the base-case recursions, and all those values are
1, the return value is the number of base-case recursions. The total
number of fib calls is one less than twice the number of base-case
recursions.)
Here's a version in Bicicleta's language:
{env: fib = {fib: arg1 = 3
'()' = (fib.arg1 < 2).if_true(then = 1,
else = env.fib(fib.arg1-1) + env.fib(fib.arg1-2))}
}.fib(16)
That runs about about 120-140 base-case recursions per second, running
on top of the OCaml implementation mentioned earlier, and it seems
to take time roughly linear in the number of base-case recursions.
That's slower by about a factor of 44000 than its host interpreter,
which is a lot. It's still fast enough that you could use it for
small experiments. There's some headroom (about a factor of 10 or 20)
above the OCaml implementation that I can probably take advantage of
by using ocamlopt and not doing CPU emulation.
The basic plan was as follows:
- spend as little effort as possible to write a very minimal
implementation in some language with an implementation that already
runs, providing as few primitives as possible. The current
implementation is 372 non-blank non-comment non-unit-test lines of
OCaml, ocamllex, and ocamlyacc, and it can now run "hello world"
style programs like the one above. It's still missing some
features, especially introspective and imperative ones. This has
taken me two weeks, so I'm glad I didn't start out with a bigger
piece of the project.
- write an IDE for the language in itself. This doesn't have to be
anything fancy, but it needs to be enough that I start to experience
the benefits supposedly accruing to the spreadsheet-style UI enabled
by the language semantics.
- write a compiler in Bicicleta for itself, so that we can generate C
or JavaScript from some version of the Bicicleta metacircular
interpreter.
- compile a native-code version of the IDE and compiler using the
metacompiler. Ideally this version will support dynamic compilation
to machine code without restarts, reducing the performance problems
inherent in the language's difficult semantics to a tolerable level,
allowing focus on the next task:
- polishing the IDE and tailoring it to particular application areas.
But it looks like maybe things are slow enough that I'd better put in
a little more work on the first step before proceeding any further. I
tossed in a few lines of code to count method calls by name, and
here's what I found, on exactly the code above, which returns 1597:
sys: 326595
arg1: 313819
(): 252596
userdata: 140008
object: 110994
normal_commutative_number: 106203
native_integer: 106203
intrinsics: 106203
new: 66013
result: 36997
other: 36997
op: 36997
coerce: 36997
binop: 36997
as_integer: 36997
arg2: 36997
add: 36997
+: 36997
!!: 36997
integer_add: 33804
subtract: 32208
negated: 32208
integer_negated: 32208
-: 32208
true: 6386
if_true: 4789
less_than: 3193
integer_less_than: 3193
fib: 3193
bool: 3193
<: 3193
false: 3192
then: 1597
else: 1596
native_string: 2
show: 1
integer_show: 1
total: 2094769
So we're only calling fib 3193 times, which seems about right. But we
end up calling prog.sys 326595 times and various arg1 things 313819
times, and so on, and doing about 2.1 million calls in all. It's a
little strange that every call to 'fib' involves almost 12 calls to
'+'.
With Attribute Caching
----------------------
So this suggests that we can get a substantial speedup already by
caching object attributes, converting tree reduction into graph
reduction: maybe a factor of 100?
Well, I added the code to cache object attributes, and here were the
results:
arg1: 52675
(): 49484
userdata: 23944
sys: 19162
intrinsics: 19155
native_integer: 15963
result: 7981
other: 7981
op: 7981
new: 7981
coerce: 7981
as_integer: 7981
arg2: 7981
add: 7981
!!: 7981
+: 6385
binop: 4789
integer_add: 4788
true: 3196
if_true: 3194
less_than: 3193
integer_less_than: 3193
fib: 3193
bool: 3193
<: 3193
subtract: 3192
negated: 3192
integer_negated: 3192
false: 3192
-: 3192
then: 1597
else: 1596
object: 3
native_string: 2
show: 1
normal_commutative_number: 1
integer_show: 1
total: 309690
That's about a factor of 6.8 improvement on this microbenchmark, which
is noticeable but still not that great. This adds up to:
- 16.5 accesses to arg1;
- 15.5 calls to '()';
- 2.5 calls to '!!', add, as_integer, coerce, new, op, other, and
result;
- 2 calls to '+';
- and 1 call to '-'
per call to fib.
And it runs considerably faster, at 650 base-case recursions per
second, about 5 times faster. A lot less than the factor of 100 I
hoped for! But not bad for adding the 12 lines of code to cache the
results.
Some form of this optimization is absolutely necessary for any larger
program.
Memory Usage With Attribute Caching
-----------------------------------
However, it also took 100 MB of RAM to calculate fib(20). Another six
lines of code to only allocate caches only for objects that use them
cuts it down to 70MB (and speeds it up very slightly), but 70MB is
still way too much --- that's for about 2.1 million calls. It runs in
more or less constant space (hovering around 3.3MB for four minutes)
if I clear the cache immediately after putting each item into it,
which of course makes it run very slowly again, but it makes it clear
that it's the caches that are the problem and not, say, a lack of
tail-recursion.
The size of the problem suggests that it's hanging on to some amount
of stuff from every one of those 2.1 million calls --- not just the 20
that might be on the call stack at any one time. This surprised me,
because I would expect expressions like the env.fib{fib.arg1-2} in
env.fib{fib.arg1-2}.'()' to become garbage fairly quickly --- it's an
anonymous class produced by inheriting from env.fib, and the only
thing we hold on to from it is its '()'.
Since this is just a cache-management problem, coming up with a
version that doesn't break stuff is easy, but performance
characteristics will be potentially very tricky.
Slowness Is Partly Due To Indirection
-------------------------------------
It's still about 8000 times slower than OCaml.
Looking at it a slightly different way, it's executing about 92000
method calls per second in order to perform the 650 base-case
recursions or 1300 calls to "fib" per second, while the OCaml and
other versions need only execute one function-call and return per call
to "fib". Considered that way, it's only about 110 times slower than
its host OCaml, which is pretty much what you'd expect for an
interpreter. It's just that building each call to "fib" out of 70
lower-level method calls ends up costing an additional factor of 70 in
performance.
So if, for example, you took out some of the library magic to support
multimode arithmetic and double dispatch and the very small primitive
set, it could get a lot faster. Compare --- fib(20) on the normal
version:
arg1: 361192
(): 339303
userdata: 164179
sys: 131350
intrinsics: 131343
native_integer: 109453
result: 54726
other: 54726
op: 54726
new: 54726
coerce: 54726
as_integer: 54726
arg2: 54726
add: 54726
!!: 54726
+: 43781
binop: 32836
integer_add: 32835
true: 21894
if_true: 21892
less_than: 21891
integer_less_than: 21891
fib: 21891
bool: 21891
<: 21891
subtract: 21890
negated: 21890
integer_negated: 21890
false: 21890
-: 21890
then: 10946
else: 10945
object: 3
native_string: 2
show: 1
normal_commutative_number: 1
integer_show: 1
total: 2123396
real 0m23.360s
user 0m23.084s
sys 0m0.252s
And a version where '+', '-', and '<' just go straight to a primitive
without any further layers of indirection, and we have a '-'
primitive:
'()' = arg1: 186070
(): 131345
sys: 109460
userdata: 109454
native_integer: 87563
arg2: 54726
true: 32840
new: 32836
false: 32835
if_true: 21892
integer_less_than: 21891
fib: 21891
bool: 21891
<: 21891
integer_subtract: 21890
-: 21890
then: 10946
integer_add: 10945
else: 10945
+: 10945
object: 3
native_string: 2
show: 1
normal_commutative_number: 1
intrinsics: 1
integer_show: 1
total: 974155
real 0m8.908s
user 0m8.764s
sys 0m0.131s
The normal version has 2.18 times as many calls and takes 2.62 times
as much time; that makes this version only about 3000 times as slow as
OCaml. So this suggests that more symbolic work should suffer
somewhat less of a speed penalty than the fib microbenchmark
suggests. Also, although I feel a little silly suggesting it at this
point given the gross ineffiency of the interpreter in general, that's
the kind of cost that polymorphic inline caches and the like can help
a lot with.
Could It Be Fast Enough Already?
--------------------------------
I'm kind of thinking that 1300 calls per second, plus a hypothetical
10x speedup from compiling OCaml to native code, makes it about the
same speed as the Bourne shell, far slower than Tcl. I wouldn't
really want to develop a compiler written in the Bourne shell, but
that has more to do with the semantics of the language than with its
performance.
Here's a dumb fib in bash, coded to avoid forking:
#!/bin/bash
fib () {
if [[ $1 -lt 2 ]] ; then rv=1 ; else
fib $(( $1 - 1 )); set $1 $rv
fib $(( $1 - 2 )); rv=$(( $rv + $2 ))
fi
}
fib 25
echo $rv
The bash version runs at about 7800 base cases per second; we can
expect the Bicicleta one to run at 6500 if the speedup is exactly 10x.
Vector Operations?
------------------
If I implement some APL-style vector operations as primitives, which
would probably be another 50-100 lines of OCaml, then Bicicleta
programs will get essentially native speed on vectorizable operations.
That would be great for interactive image processing and 3-D graphics,
but I'm not sure it would help with writing a Bicicleta metacompiler.
It also has the drawback that it requires that many more primitives
from future implementations of the language. Right now, I'm not even
using an integer subtraction primitive --- I'm using the negation and
addition primitives instead. In theory, I should be able to get all of
integer comparison and arithmetic in exchange for just add, negated,
multiply, divmod, power, and less-than.