So I've been prototyping a Bicicleta interpreter in OCaml, and
generally the experience has been great.  I'd never used lex or yacc,
but I know most of the theory, so given the reference-manual
documentation for ocamllex and ocamlyacc, I was able to put together
useful parsers pretty quickly.

And OCaml is nice and fast.  For most things, fast doesn't matter all
that much any more; but a language interpreter, by its nature, imposes
some significant slowdown on the language it's interpreting, usually
from one to three orders of magnitude.  So implementing your language
interpreter in, say, Python, may not be a good idea --- the two orders
of magnitude slowdown imposed by Python are two orders of magnitude
that get imposed on every program running in your interpreter, on top
of the 1-3 your interpreter imposes.  Multiply it all out, and you can
end up iterating through loops thousands of times per second, instead
of hundreds of millions.

OCaml really seems to be optimized for language implementations:
pattern-matching, the facility it calls "variants", ocamllex and
ocamlyacc, garbage collection, and static type-checking all make a lot
of sense for compilers.

And you can compile and deliver a binary.

But I'm being tempted by SBCL!

The Microbenchmark (OCaml, Python, Perl, Ruby, Elisp, Tcl)
----------------------------------------------------------

There's a microbenchmark I like to run on language implementations;
here's the OCaml version:

    let rec fib n = if n < 2 then 1 else fib (n-1) + fib (n-2) in
    print_int (fib 32) ; print_newline() ;;

This computes a Fibonacci number in a pretty slow way, recursing into
the base case N times in order to compute the number N.  On Bea's
laptop, one of the new Intel MacBooks, it runs at about 3.1 million
base-case recursions per second in the OCaml interpreter, or 5.2
million per second if I ocamlc it first.  This may be a little unfair
to OCaml, since I'm running a PowerPC version on this machine, through
transparent binary translation.

But the Python version runs at 0.73 million base-case recursions per
second:

    def fib(n):
        if n < 2: return 1
        return fib(n-1) + fib(n-2)
    print fib(32)

That's Python 2.3, compiled for Intel.  The Perl version runs at the
same speed:

    #!/usr/bin/perl -w
    use strict;
    sub fib { $_[0] < 2 ? 1 : fib($_[0] - 1) + fib($_[0] - 2) }
    print fib(32), "\n";

So does Ruby (well, actually, 0.75 million):

    #!/usr/bin/ruby
    def fib(n)
      if n < 2 then 1 else fib(n-1) + fib(n-2) end
    end
    print fib(32), "\n"

Byte-compiled elisp is a bit slower, at 0.46 million per second, timed
with this:

    (let ((start (current-time))) (setq fib30 (fib 30)) (time-since start))

(see below in the SBCL section for the Lisp code for fib).  Tcl is
even worse, at 0.083 million per second:

    proc fib {x} { if {$x < 2} { return 1 } { 
        return [expr [fib [expr $x - 1]] + [fib [expr $x - 2]]] 
    } }
    puts [fib 28]

So OCaml really isn't doing that badly!  Especially considering it's
running on an emulated CPU.

I probably should have included JavaScript here (since it is, after
all, The Next Mainstream Programming Language, after all) but I
didn't.  My experience is that, in both Firefox 1.5 and Safari, it's
usually in about the same speed range as Python, Perl, Ruby, and
Elisp.

SBCL
----

But then I tried it in (Intel) SBCL:

    (defun fib (n) (if (< n 2) 1 (+ (fib (- n 1)) (fib (- n 2)))))
    (print (fib 43))

And it ran, in the interactive interpreter, at 19 million per second,
four times as fast as OCaml.  SBCL, it turns out, immediately compiles
everything to native code, even in the interactive interpreter.
Compiling the above function takes 12 milliseconds.

SBCL also lets you do (disassemble 'fib), which I did, and I saw that
this version was still calling out to do generic arithmetic and
comparison operations, even though SBCL's compiler does some type
inference.  So I whacked at it a bit until it didn't do that any more:

    (defun fib (n) (if (< (the fixnum n) 2) 
                       1 
                     (+ (the fixnum (fib (- n 1))) 
                        (the fixnum (fib (- n 2))))))
    (print (fib 43))

This doesn't omit the type-checking entirely, but it omits the dynamic
dispatch of the numeric operations, and consequently it ran more than
twice as fast, at 47 million iterations per second, 9 times as fast as
OCaml.  I haven't figured out how to turn off the run-time
type-checking on each call entirely, which I think might trim it from
91 instructions to 20 or so.

The main body of those instructions, with added comments, is as follows:

115B0F9A:       8B55F0           MOV EDX, [EBP-16]  ; no-arg-parsing entry point
    0F9D:       F6C203           TEST DL, 3   ; examine low two bits (tag)
    0FA0:       0F858E000000     JNE L4       ; should be 00 for fixnum
    0FA6:       83FA08           CMP EDX, 8   ; 8 is fixnum 2 (<<2)
    0FA9:       7C70             JL L3        ; so if it's less, we go to L3
    ; This next instruction is unnecessary because none of the previous four
    ; instructions modify the registers!
    0FAB:       8B55F0           MOV EDX, [EBP-16]  ; fetch the argument again
    0FAE:       83EA04           SUB EDX, 4   ; subtract fixnum 1
    ; Here's the call sequence for FIB:
    0FB1:       8BDC             MOV EBX, ESP ; save old stack pointer
    0FB3:       83EC0C           SUB ESP, 12  ; allocate three words
    ; Load function pointer from #<FDEFINITION object for FIB>
    0FB6:       8B05340F5B11     MOV EAX, [#x115B0F34]
    0FBC:       B904000000       MOV ECX, 4   ; load fixnum 1
    0FC1:       896BFC           MOV [EBX-4], EBP  ; save old base pointer
    0FC4:       8BEB             MOV EBP, EBX ; stick old stack pointer in EBP
    0FC6:       FF5005           CALL DWORD PTR [EAX+5] ; recursive call
    0FC9:       7302             JNB L0       ; skip next instruction ...?
    0FCB:       8BE3             MOV ESP, EBX ; restore old stack pointer
    0FCD: L0:   F6C203           TEST DL, 3   
    0FD0:       7568             JNE L5       ; type-tag test on returned value
    0FD2:       8955F4           MOV [EBP-12], EDX  ; save returned value
    0FD5:       8B55F0           MOV EDX, [EBP-16]  ; fetch the argument again
    0FD8:       83EA08           SUB EDX, 8         ; subtract fixnum 2
    ; same call/return sequence again:
    0FDB:       8BDC             MOV EBX, ESP
    0FDD:       83EC0C           SUB ESP, 12
    0FE0:       8B05340F5B11     MOV EAX, [#x115B0F34]
                ; #<FDEFINITION object for FIB>
    0FE6:       B904000000       MOV ECX, 4
    0FEB:       896BFC           MOV [EBX-4], EBP
    0FEE:       8BEB             MOV EBP, EBX
    0FF0:       FF5005           CALL DWORD PTR [EAX+5]
    0FF3:       7302             JNB L1
    0FF5:       8BE3             MOV ESP, EBX
    ; End of call-return sequence.  Whew!
    0FF7: L1:   F6C203           TEST DL, 3 ; type-tag test, again
    0FFA:       7544             JNE L6
    0FFC:       8B45F4           MOV EAX, [EBP-12] ; load other returned value
    ; This next sequence does the fixnum arithmetic in full 32-bit
    ; registers instead of the 30-bit fixnum form for some reason.
    0FFF:       C1F802           SAR EAX, 2    ; shift arithmetic right 2 bits
    1002:       C1FA02           SAR EDX, 2    ; on both
    1005:       01D0             ADD EAX, EDX  ; DO THE ADDITION!
    1007:       8BD0             MOV EDX, EAX  ; make useless copy of result
    1009:       D1E2             SHL EDX, 1    ; shift left 1 bit
    100B:       7039             JO L7         ; jump if overflow
    100D:       D1E2             SHL EDX, 1
    100F:       7035             JO L7

    ; End of addition sequence.  For positive fixnums, this would have
    ; been better and exactly equivalent (except that it doesn't
    ; clobber EAX):
    ; ADD EDX, EAX
    ; JC L7
    ; but I think that won't work for negative fixnums.

    ; At this point, we have the return value in EDX, and we're ready to return.
    1011: L2:   8D65F8           LEA ESP, [EBP-8] ; pop stack frame
    1014:       F8               CLC              ; clear carry flag (for JNB?)
    1015:       8B6DFC           MOV EBP, [EBP-4] ; restore old frame pointer?
    1018:       C20400           RET 4
    101B: L3:   BA04000000       MOV EDX, 4  ; fixnum 1 if n<2
    1020:       EBEF             JMP L2      ; ret val fixnum 1 is in EDX...

There's another 44 instructions having to do mostly with exception and
error handling: if the argument n wasn't fixnum (L4), if the first
recursion returned a non-fixnum (L5), if the second recursion returned
a non-fixnum (L6), if the addition overflowed to a bignum (L7); and
there are some NOPs and dead code for handling invalid arg counts.

Presumably OCaml programs would run at that speed too, or faster, if I
compiled them to native code, but the native-code compiler needs an
assembler, and installing an assembler under MacOS X apparently
requires 3+ gigabytes of disk space.  ocamlopt, for its part, compiles
the fib routine in this microbenchmark to 24 PowerPC instructions,
which, if I had an assembler, I could benchmark.  (It might be more to
the point to test the Intel OCaml.)

Here's the ocamlopt PowerPC assembly; notice that it is much simpler:

    _camlFib__fib_57:
            mflr        r0
            addi        r1, r1, -16
            stw r0, 12(r1)
    L101:
            cmpwi       r3, 5
            bge L100
            lwz r11, 12(r1)
            mtlr        r11
            li  r3, 3
            addi        r1, r1, 16
            blr
    L100:
            stw r3, 0(r1)
            addi        r3, r3, -4
    L102:
            bl  _camlFib__fib_57
            lwz r4, 0(r1)
            stw r3, 4(r1)
            addi        r3, r4, -2
    L103:
            bl  _camlFib__fib_57
            lwz r11, 12(r1)
            mtlr        r11
            lwz r4, 4(r1)
            add r7, r3, r4
            addi        r3, r7, -1
            addi        r1, r1, 16
            blr

And I got to thinking.  

Dynamic Code Generation
-----------------------

If I write an implementation in OCaml, I get all of OCaml's shiny
machinery for interpreting or compiling, but as far as I can tell,
there is no "eval".  There's the Dynlink module, which lets bytecode
programs dynamically load bytecode files, so if I compiled stuff into
OCaml, I could dynamically load the result (if I forgo native-code
compilation); but the OCaml compiler isn't actually all that fast, so
it might cause long pauses.

But if I write an implementation in SBCL, I can dynamically compile
code into a running SBCL process very easily, like with two lines of
code:

    This is SBCL 0.9.15, an implementation of ANSI Common Lisp.
    ...
    * (defun return-const-fun (name const) `(defun ,name () ,const))
    RETURN-CONST-FUN
    * (eval (return-const-fun 'foobar 40))
    FOOBAR
    * (disassemble 'foobar)
    ; 11608A1A:       BAA0000000       MOV EDX, 160 ; no-arg-parsing entry point
    ;       1F:       8D65F8           LEA ESP, [EBP-8]
    ;       22:       F8               CLC
    ;       23:       8B6DFC           MOV EBP, [EBP-4]
    ;       26:       C20400           RET 4
    ... a bunch of nops and boilerplate omitted ...

The first line defines a function that generates code; the second one
invokes it and compiles the result, into, as it turns out, five
machine instructions.  (160 is 40, shifted left by two bits.)

And I have the impression that generating code in Common Lisp will be
easier than generating code in OCaml, C, or assembler.

Java
----

Oh, yeah, Java is even faster:

    class Fib {
        public static int fib(int n) {
            if (n < 2) return 1;
            return fib(n-1) + fib(n-2);
        }
        public static void main(String[] argv) {
            System.out.println(new Integer(fib(40)).toString());
        }
    }

That returns 165580141 in 2.4 seconds, about 69 million base-case
recursions per second, or 14 nanoseconds each.

But I'm not feeling tempted by Java.  Sorry.

Squeak
------

I'm also a little tempted by Smalltalk.

The Smalltalk version is 

    fib: n
            n < 2 ifTrue: [^1] 
                  ifFalse: [^(self fib: n - 1) + (self fib: n - 2)]

In Squeak 3.8-6665full on an Intel VM, that takes 4410 ms to run 
fib: 35, which is 14930352; that's 0.295 μs per base-case recursion,
or 3.4 million per second, slightly slower than OCaml.  Which is
fairly impressive, considering that it's doing the same level of
dynamic dispatch as SBCL, but in a bytecode engine like OCaml's.

The bytecode looks like this:

    9 <10> pushTemp: 0
    10 <77> pushConstant: 2
    11 <B2> send: <
    12 <99> jumpFalse: 15
    13 <76> pushConstant: 1
    14 <7C> returnTop
    15 <70> self
    16 <10> pushTemp: 0
    17 <76> pushConstant: 1
    18 <B1> send: -
    19 <E0> send: fib:
    20 <70> self
    21 <10> pushTemp: 0
    22 <77> pushConstant: 2
    23 <B1> send: -
    24 <E0> send: fib:
    25 <B0> send: +
    26 <7C> returnTop
 
Or, as I read it in pseudo-FORTH, "n 2 < if 1 return then self n 1 -
recurse self n 2 - recurse + return".  The <E0> bytecode that calls
#fib: seems to be implemented by a method called
sendLiteralSelectorBytecode, which occupies all of the bytecodes <D0>
through <FF>, with the last four bits indexing into a "literal table"
associated with the method.  It's impressive to me that Squeak got
this into 18 bytes of bytecode (plus 8 bytes of other stuff),
considering that the original is 27 tokens.  (Subtracting all the
delimiters and considering #ifTrue:ifFalse: as one "token", we get
down to 18 source tokens.)

Python's bytecode
-----------------

Compare the Python bytecode, from dis.dis(fib.fib):

  3           0 LOAD_FAST                0 (n)
              3 LOAD_CONST               1 (2)
              6 COMPARE_OP               0 (<)
              9 JUMP_IF_FALSE            8 (to 20)
             12 POP_TOP             
             13 LOAD_CONST               2 (1)
             16 RETURN_VALUE        
             17 JUMP_FORWARD             1 (to 21)
        >>   20 POP_TOP             

  4     >>   21 LOAD_GLOBAL              1 (fib)
             24 LOAD_FAST                0 (n)
             27 LOAD_CONST               2 (1)
             30 BINARY_SUBTRACT     
             31 CALL_FUNCTION            1
             34 LOAD_GLOBAL              1 (fib)
             37 LOAD_FAST                0 (n)
             40 LOAD_CONST               1 (2)
             43 BINARY_SUBTRACT     
             44 CALL_FUNCTION            1
             47 BINARY_ADD          
             48 RETURN_VALUE        
             49 LOAD_CONST               0 (None)
             52 RETURN_VALUE        

Python's bytecode, very similar in design, uses 53 bytes for the same
method!

Much of the difference comes from Python using three bytes for
anything that takes a parameter, so it only uses a single bytecode
(called LOAD_FAST) for pushing parameters and the like, where Squeak
uses all the bytes from <10> to <1F> for pushing the first 16 local
variables; then there's <80> for up to 32 local variables.  I don't
know what it does when you have more than 32 local variables.

Similarly, CALL_FUNCTION takes two bytes to tell it how many arguments
the function has, I guess in case you have a function with 16383
arguments, and additionally it's separated from the LOAD_GLOBAL
bytecode.

Then Python's JUMP_IF_FALSE doesn't pop the boolean value it uses, so
we need a separate bytecode for that, and then there are dead
bytecodes at 17, 49, and 50, immediately after RETURN_VALUE bytecodes,
and a dead bytecode at 20, immediately after an unconditional jump.

The consequence of all of this is that Python is only currently using
109 of the 256 possible bytecodes, that Python bytecode is bloated and
slow, and that the Python compiler and bytecode interpreter are fairly
simple.

Other Bytecodes
---------------

As far as I can tell, OCaml doesn't come with anything analogous to
SBCL's (disassemble 'fib), Python's dis module, or Squeak's browser
option "What to show: bytecodes", so all I know about the OCaml
bytecode is that the whole fib.cmo bytecode file produced by ocamlc
was 320 bytes, and that .cmo files seem to largely consist of 32-bit
big-endian binary two's-complement integers, with a big blob of stuff
at the end containing strings and other data.

I don't know anything about the implementation of the Ruby
interpreter.

perl -MO=Concise,fib fib.pl (using the B::Concise module) tells me the
following:

main::fib:
7  <1> leavesub[1 ref] K/REFC,1 ->(end)
-     <@> lineseq KP ->7
1        <;> nextstate(main 2 fib.pl:3) v/2 ->2
-        <1> null K/1 ->-
5           <|> cond_expr(other->6) K/1 ->8
4              <2> lt sK/2 ->5
-                 <1> ex-aelem sK/2 ->3
-                    <1> ex-rv2av sKR/3 ->-
2                       <#> aelemfast[*_] s ->3
-                    <0> ex-const s ->-
3                 <$> const[IV 2] s ->4
6              <$> const[IV 1] s ->7
k              <2> add[t10] sK/2 ->7
d                 <1> entersub[t5] sKS/TARG,3 ->e
-                    <1> ex-list sK ->d
8                       <0> pushmark s ->9
b                       <2> subtract[t4] sKM/2 ->c
-                          <1> ex-aelem sK/2 ->a
-                             <1> ex-rv2av sKR/3 ->-
9                                <#> aelemfast[*_] s ->a
-                             <0> ex-const s ->-
a                          <$> const[IV 1] s ->b
-                       <1> ex-rv2cv sK/3 ->-
c                          <#> gv[*fib] s/EARLYCV ->d
j                 <1> entersub[t9] sKS/TARG,3 ->k
-                    <1> ex-list sK ->j
e                       <0> pushmark s ->f
h                       <2> subtract[t8] sKM/2 ->i
-                          <1> ex-aelem sK/2 ->g
-                             <1> ex-rv2av sKR/3 ->-
f                                <#> aelemfast[*_] s ->g
-                             <0> ex-const s ->-
g                          <$> const[IV 2] s ->h
-                       <1> ex-rv2cv sK/3 ->-
i                          <#> gv[*fib] s/EARLYCV ->j

This is pretty confusing; it helps to know the following operations
occur before and to the left of their operands; the two "entersub"
nodes are the recursive calls and the const[IV 2] nodes are integers
such as 2; Perl's treecode runs with a stack as its working memory; it
builds lists on the stacks PostScript-style, starting with a
"pushmark"; every procedure call involves constructing a list of
parameters.

There's a '-exec' option to B::Concise that puts things in execution
order, which makes it look a lot like Python, but leaves out all the
ops labeled above with a "-".

Java is yet another stack-based bytecode virtual machine.  javap -c
Fib says:

    public static int fib(int);
      Code:
       0:       iload_0
       1:       iconst_2
       2:       if_icmpge       7
       5:       iconst_1
       6:       ireturn
       7:       iload_0
       8:       iconst_1
       9:       isub
       10:      invokestatic    #2; //Method fib:(I)I
       13:      iload_0
       14:      iconst_2
       15:      isub
       16:      invokestatic    #2; //Method fib:(I)I
       19:      iadd
       20:      ireturn

I suspect that the numbers in the left column are byte offsets, which
would make this method 21 bytes, but I don't know enough about Java
bytecode to be sure.

Elisp compiles the same Lisp I fed to SBCL into the following
stack-based bytecode:

    byte code for fib:
      args: (n)
    0       varref    n
    1       constant  2
    2       lss       
    3       goto-if-nil 1
    6       constant  1
    7       return    
    8:1     constant  fib
    9       varref    n
    10      sub1      
    11      call      1
    12      constant  fib
    13      varref    n
    14      constant  2
    15      diff      
    16      call      1
    17      plus      
    18      return    

Or, more briefly, asciified with cat -vt:

    (defalias 'fib #[(n) 
        "[EMAIL PROTECTED]" [n 2 1 fib] 4])

That is 19 bytes of bytecode, although much of it is represented with
octal backslash-escapes.

The bytecode is not documented in section 16.8 of the Elisp Reference
Manual, "Disassembled Byte-Code".

Bytecode as an Implementation Strategy
--------------------------------------

Bytecode has several claimed advantages over dynamic compilation as a
strategy for implementing high-level languages: it's smaller than
machine code, interactive compilation is faster because the compiler
does less work, it's an architecture-neutral distribution format, and
porting to new platforms is easier.  There are, however, some
disadvantages.

* It's Smaller Than Machine Code

The compiled optimized SBCL version totaled 247 bytes of machine code.
The unoptimized version, with calls to GENERIC-< and the like, was 162
bytes.  And this is on the x86, which has historically been pretty
good for code density, although as you can see in the above, the
32-bit immediate constants diminish that advantage significantly; 60
of the 247 bytes are in 32-bit constants.  

The OCaml version is 24 instructions, 8 of which have immediate
constants.  I don't know very much about PowerPC assembly, but let's
suppose that every instruction is 32 bits, including any immediate
constants; that means the whole function weighs 96 bytes.

Compare this to the 26 bytes of the Squeak version, and you can see
that bytecode systems have a compelling advantage in situations where
code size is critical.  Even indirect-threaded code would probably
have been at least 32 bytes, assuming 16-bit addresses.

So at least this claimed advantage is well-founded.  If you write the
same program in the same way in SBCL and in Squeak, it is likely to
take considerably more code space in SBCL.

* Smaller Than How Much Machine Code?

Speaking of code compactness, though, for small programs, the
machine-code version of your program may be smaller than the bytecode
interpreter.  I tried to build the C version of Squeak with "VMMaker
new generateEntire" to see how big it was, but apparently there are
some .h files and things that live outside of the Squeak image; the
best I can say is that Interpreter.st is 337404 bytes and
ObjectMemory.st is 96646 bytes, for a total of 434050 bytes, about
7000-10000 lines of Smalltalk, gzipping to 98274 bytes.  I am guessing
that the machine-code version would be around 100KiB.

Presumably you can build an almost arbitrarily-small bytecode
interpreter (say, one that implements just the S and K combinators)
but the question is how big the bytecode interpreter has to be to make
the bytecode version of your program smaller than the corresponding
machine-code version, while still running at an acceptable speed.

Chuck Moore's Novix NC-4000 and MuP21 work suggests that a useful
bytecode interpreter could be very small indeed; the MuP21 is a
6000-transistor chip in which the CPU (there's also some other stuff
on the chip) executes a stream of 5-bit zero-operand two-stack
operations packed into 20-bit words.  (I forget how LITERAL, aka
load-immediate, is implemented.)  The code for the machine is claimed
to be quite compact.  Unfortunately, the follow-on chips (the F21, the
iTV i21, and the 25x) have not been successfully fabricated.

* Interactive Compilation is Faster

This advantage is probably irrelevant now, except for very small
machines, say a 70MHz LPC2103 microcontroller.  The SBCL compiler
compiled the unoptimized Lisp version in 12ms on a 1.8GHz dual-core
CPU; a cleverer implementation, like Apple ][ Integer Basic, could
parse and compile as you typed, keeping pauses to a minimum.

* Architecture-Neutral Distribution Format

That is to say, you don't lose your software when you change CPU
architectures.

Source code is also an architecture-neutral distribution format; I am
going to ignore the political reasons for not distributing source code
and focus on the technical reason, which is that bytecode is smaller.
In effect, the claimed advantage is that bytecode is a good
compression algorithm for source code.

If this is the case, we should perhaps compare it against generic
compression algorithms like gzip.  Here are some results from gzip -9:

file                 size  gzipped  ratio  language
bicicleta_lexer.mll  2380      965    2.5  ocamllex
readbmp.py           5022     1910    2.6  Python
js-calc.js          15520     5767    2.7  JavaScript
webrick/cgi.rb       6779     2251    3.0  Ruby
cursmail.py         11627     3938    3.0  Python
bicicleta.ml        13426     3461    3.9  OCaml

I think these are fairly typical compression ratios for gzip on source
code.

The Perl source code version of the fib function is 60 bytes; OCaml is
61, Ruby is 62, unoptimized Lisp is 63, Python is 66, Squeak is 75 if
you put it all on one line, Java is 87 if you put it all one line, and
Tcl is 108.  So if 2.7 is a typical compression ratio and 63 bytes is
a typical size for implementations of this function, we'd expect gzip
to produce a result of about 23.3 bytes; if you apply the 2.7 ratio to
the Smalltalk function, you get 27.8 bytes.  These compare favorably
with the Squeak bytecode version at 27 bytes.

For larger functions, bytecode seems to work better as compression.
Here's SkipList>>add:ifPresent:, which was the first largish method I
found while looking around Squeak at random:

    add: element ifPresent: aBlock
            | node lvl s |
            node := self search: element updating: splice.
            node ifNotNil: [aBlock ifNotNil: [^ aBlock value: node]].
            lvl := self randomLevel.
            node := SkipListNode on: element level: lvl.
            level + 1 to: lvl do: [:i | splice at: i put: self].
            1 to: lvl do: [:i |
                                    s := splice at: i.
                                    node atForward: i put: (s forward: i).
                                    s atForward: i put: node].
            numElements := numElements + 1.
            splice atAllPut: nil.
            ^ element

That's 619 bytes of source if tabified, or 465 bytes if untabified.
The bytecode representation is 118 bytes, including 32 bytes of
non-bytecode (literal and constant tables, I guess?) and 86 bytes of
bytecode.  That's a compression ratio of 3.9 to 5.2, which is
substantially better than gzip would do.

(Maybe I should try 7-zip, but it's not installed on this Mac.  bzip2
does worse than gzip on such small files.)

The names of the method selectors and classes referenced herein are
not included in those 32 bytes --- rather, there are seven four-byte
pointers: #search:updating: #randomLevel, #on:level:, SkipListNode,
#atForward:put:, #forward:, and #atAllPut:.  (The other method
selectors, like #+, #at:, #value: and #at:put:, have special bytecodes
for them.)  These 69 bytes have to be included somewhere in a bytecode
file intended for use in an architecture-neutral distribution format
that supports dynamic linking, but they only have to be included once,
and if you don't need to link dynamically or modify the software after
delivery, they don't need to be included at all.

So it looks like bytecode is a somewhat more compact architecture-
neutral distribution format than gzipped source code; the latter has
some additional technical advantages, such as being easier to modify,
being easier to optimize, containing comments, and not requiring a
special-purpose compressor.

However, this advantage is less compelling than the code-size
advantage; while Squeak needed a factor of 3 to 10 less code space
than native-code compilation systems, it looks like bytecode uses more
like a factor of 1 or 2 less space than gzipped source code --- in a
couple of trivial examples, which might not be completely realistic.

* Porting to new platforms is easier

I'm not sure if this is true, but it seems plausible: a native-code
generator is more tightly coupled to the CPU architecture than a
bytecode interpreter, because if the bytecode interpreter is written
in a language available on the new target platform, you may be able to
port it with a simple recompile.  This may seem irrelevant today when
all "personal computers" and most servers use the Intel 80386
instruction set, but there is a lot of uncertainty on the horizon:

- ARM is still very popular on mobile phones, which are computers, are
  personal, and outnumber "PCs" perhaps ten to one;
- AVR and PIC are very popular in the deep-embedded space;
- the smoothness of Apple's recent move to Intel processors
  demonstrates that the barrier to entry for new CPU architectures is
  at an all-time low;
- most networked appliances (from companies like Cisco, D-Link, Axis,
  and Broadcom) use non-386 CPUs that run Linux, and there are a huge
  number of them;
- the Cell chip in the Playstation 3 has two different kinds of
  non-386 CPUs, and like the above, game consoles outnumber "PCs", but
  unlike the above, have enormous computational power;
- after the coming transition to highly concurrent software takes
  place, the metric by which CPUs are valued may change from serial
  processing throughput to aggregate processing throughput, which may
  create an opportunity for disruptive CPU innovation.

But, for the moment, portability matters much less than it used to.

* Other advantages?

It could be argued that bytecode compiler/interpreters are simpler
than dynamic native-code compilers.  I don't have enough experience to
know whether this is true in general, but I'm pretty sure that if
you're running on top of an environment like SBCL, it isn't.  Even
spawning an instance of gcc and dlopening the resulting object is
likely to be an acceptable speed, and then you only have to generate C
code.  (You have to be careful not to #include any large .h files,
though.)

Parse-tree walkers, such as the Perl interpreter and the kind of Lisp
interpreter found in books like SICP and EOPL, are simpler than either
one, but they tend to be slower too.

* Disadvantages

Bytecode interpreters are usually pretty slow; the bytecode
interpreter used by Squeak probably accounts for most of the
difference between it (295ns) and unoptimized SBCL (52ns).

JIT bytecode interpreters, such as recent Java virtual machines, can
be less slow, but they clearly lose the code-size advantage normally
enjoyed by bytecode interpreters, the postulated simplicity advantage,
and probably the compilation-time advantage.

* Conclusion

Bytecode interpreters can offer unsurpassed code compactness, although
Python's and OCaml's bytecode formats completely fail to do so.  This
matters a lot on microcontrollers, but I don't know whether it matters
a lot or a little on modern personal computer CPUs --- this Mac has
2GiB of RAM, but only 2MiB of L2 cache, and presumably something like
16-64KiB of L1 cache.  Thrashing the cache is soundly punished.

They also offer a size advantage over gzipped source as a software
distribution format, but this advantage is less compelling.  This
suggests that they ought to be popular in the demoscene, where there's
apparently a lot of action in the "coolest audiovisual demo under N
bytes" categories these days, for values of N such as 128, 256, 4096,
and 65536; if they aren't, it casts doubt on this proposed advantage.

I know of no other reason to implement an interpreter using bytecode.

So I'm surprised it's such a popular thing to do!  I think the reason
is probably that code space and compilation time used to be quite
precious resources (not to mention portability), and programmers just
haven't adjusted to the new realities.

Reply via email to