So I've been prototyping a Bicicleta interpreter in OCaml, and generally the experience has been great. I'd never used lex or yacc, but I know most of the theory, so given the reference-manual documentation for ocamllex and ocamlyacc, I was able to put together useful parsers pretty quickly.
And OCaml is nice and fast. For most things, fast doesn't matter all that much any more; but a language interpreter, by its nature, imposes some significant slowdown on the language it's interpreting, usually from one to three orders of magnitude. So implementing your language interpreter in, say, Python, may not be a good idea --- the two orders of magnitude slowdown imposed by Python are two orders of magnitude that get imposed on every program running in your interpreter, on top of the 1-3 your interpreter imposes. Multiply it all out, and you can end up iterating through loops thousands of times per second, instead of hundreds of millions. OCaml really seems to be optimized for language implementations: pattern-matching, the facility it calls "variants", ocamllex and ocamlyacc, garbage collection, and static type-checking all make a lot of sense for compilers. And you can compile and deliver a binary. But I'm being tempted by SBCL! The Microbenchmark (OCaml, Python, Perl, Ruby, Elisp, Tcl) ---------------------------------------------------------- There's a microbenchmark I like to run on language implementations; here's the OCaml version: let rec fib n = if n < 2 then 1 else fib (n-1) + fib (n-2) in print_int (fib 32) ; print_newline() ;; This computes a Fibonacci number in a pretty slow way, recursing into the base case N times in order to compute the number N. On Bea's laptop, one of the new Intel MacBooks, it runs at about 3.1 million base-case recursions per second in the OCaml interpreter, or 5.2 million per second if I ocamlc it first. This may be a little unfair to OCaml, since I'm running a PowerPC version on this machine, through transparent binary translation. But the Python version runs at 0.73 million base-case recursions per second: def fib(n): if n < 2: return 1 return fib(n-1) + fib(n-2) print fib(32) That's Python 2.3, compiled for Intel. The Perl version runs at the same speed: #!/usr/bin/perl -w use strict; sub fib { $_[0] < 2 ? 1 : fib($_[0] - 1) + fib($_[0] - 2) } print fib(32), "\n"; So does Ruby (well, actually, 0.75 million): #!/usr/bin/ruby def fib(n) if n < 2 then 1 else fib(n-1) + fib(n-2) end end print fib(32), "\n" Byte-compiled elisp is a bit slower, at 0.46 million per second, timed with this: (let ((start (current-time))) (setq fib30 (fib 30)) (time-since start)) (see below in the SBCL section for the Lisp code for fib). Tcl is even worse, at 0.083 million per second: proc fib {x} { if {$x < 2} { return 1 } { return [expr [fib [expr $x - 1]] + [fib [expr $x - 2]]] } } puts [fib 28] So OCaml really isn't doing that badly! Especially considering it's running on an emulated CPU. I probably should have included JavaScript here (since it is, after all, The Next Mainstream Programming Language, after all) but I didn't. My experience is that, in both Firefox 1.5 and Safari, it's usually in about the same speed range as Python, Perl, Ruby, and Elisp. SBCL ---- But then I tried it in (Intel) SBCL: (defun fib (n) (if (< n 2) 1 (+ (fib (- n 1)) (fib (- n 2))))) (print (fib 43)) And it ran, in the interactive interpreter, at 19 million per second, four times as fast as OCaml. SBCL, it turns out, immediately compiles everything to native code, even in the interactive interpreter. Compiling the above function takes 12 milliseconds. SBCL also lets you do (disassemble 'fib), which I did, and I saw that this version was still calling out to do generic arithmetic and comparison operations, even though SBCL's compiler does some type inference. So I whacked at it a bit until it didn't do that any more: (defun fib (n) (if (< (the fixnum n) 2) 1 (+ (the fixnum (fib (- n 1))) (the fixnum (fib (- n 2)))))) (print (fib 43)) This doesn't omit the type-checking entirely, but it omits the dynamic dispatch of the numeric operations, and consequently it ran more than twice as fast, at 47 million iterations per second, 9 times as fast as OCaml. I haven't figured out how to turn off the run-time type-checking on each call entirely, which I think might trim it from 91 instructions to 20 or so. The main body of those instructions, with added comments, is as follows: 115B0F9A: 8B55F0 MOV EDX, [EBP-16] ; no-arg-parsing entry point 0F9D: F6C203 TEST DL, 3 ; examine low two bits (tag) 0FA0: 0F858E000000 JNE L4 ; should be 00 for fixnum 0FA6: 83FA08 CMP EDX, 8 ; 8 is fixnum 2 (<<2) 0FA9: 7C70 JL L3 ; so if it's less, we go to L3 ; This next instruction is unnecessary because none of the previous four ; instructions modify the registers! 0FAB: 8B55F0 MOV EDX, [EBP-16] ; fetch the argument again 0FAE: 83EA04 SUB EDX, 4 ; subtract fixnum 1 ; Here's the call sequence for FIB: 0FB1: 8BDC MOV EBX, ESP ; save old stack pointer 0FB3: 83EC0C SUB ESP, 12 ; allocate three words ; Load function pointer from #<FDEFINITION object for FIB> 0FB6: 8B05340F5B11 MOV EAX, [#x115B0F34] 0FBC: B904000000 MOV ECX, 4 ; load fixnum 1 0FC1: 896BFC MOV [EBX-4], EBP ; save old base pointer 0FC4: 8BEB MOV EBP, EBX ; stick old stack pointer in EBP 0FC6: FF5005 CALL DWORD PTR [EAX+5] ; recursive call 0FC9: 7302 JNB L0 ; skip next instruction ...? 0FCB: 8BE3 MOV ESP, EBX ; restore old stack pointer 0FCD: L0: F6C203 TEST DL, 3 0FD0: 7568 JNE L5 ; type-tag test on returned value 0FD2: 8955F4 MOV [EBP-12], EDX ; save returned value 0FD5: 8B55F0 MOV EDX, [EBP-16] ; fetch the argument again 0FD8: 83EA08 SUB EDX, 8 ; subtract fixnum 2 ; same call/return sequence again: 0FDB: 8BDC MOV EBX, ESP 0FDD: 83EC0C SUB ESP, 12 0FE0: 8B05340F5B11 MOV EAX, [#x115B0F34] ; #<FDEFINITION object for FIB> 0FE6: B904000000 MOV ECX, 4 0FEB: 896BFC MOV [EBX-4], EBP 0FEE: 8BEB MOV EBP, EBX 0FF0: FF5005 CALL DWORD PTR [EAX+5] 0FF3: 7302 JNB L1 0FF5: 8BE3 MOV ESP, EBX ; End of call-return sequence. Whew! 0FF7: L1: F6C203 TEST DL, 3 ; type-tag test, again 0FFA: 7544 JNE L6 0FFC: 8B45F4 MOV EAX, [EBP-12] ; load other returned value ; This next sequence does the fixnum arithmetic in full 32-bit ; registers instead of the 30-bit fixnum form for some reason. 0FFF: C1F802 SAR EAX, 2 ; shift arithmetic right 2 bits 1002: C1FA02 SAR EDX, 2 ; on both 1005: 01D0 ADD EAX, EDX ; DO THE ADDITION! 1007: 8BD0 MOV EDX, EAX ; make useless copy of result 1009: D1E2 SHL EDX, 1 ; shift left 1 bit 100B: 7039 JO L7 ; jump if overflow 100D: D1E2 SHL EDX, 1 100F: 7035 JO L7 ; End of addition sequence. For positive fixnums, this would have ; been better and exactly equivalent (except that it doesn't ; clobber EAX): ; ADD EDX, EAX ; JC L7 ; but I think that won't work for negative fixnums. ; At this point, we have the return value in EDX, and we're ready to return. 1011: L2: 8D65F8 LEA ESP, [EBP-8] ; pop stack frame 1014: F8 CLC ; clear carry flag (for JNB?) 1015: 8B6DFC MOV EBP, [EBP-4] ; restore old frame pointer? 1018: C20400 RET 4 101B: L3: BA04000000 MOV EDX, 4 ; fixnum 1 if n<2 1020: EBEF JMP L2 ; ret val fixnum 1 is in EDX... There's another 44 instructions having to do mostly with exception and error handling: if the argument n wasn't fixnum (L4), if the first recursion returned a non-fixnum (L5), if the second recursion returned a non-fixnum (L6), if the addition overflowed to a bignum (L7); and there are some NOPs and dead code for handling invalid arg counts. Presumably OCaml programs would run at that speed too, or faster, if I compiled them to native code, but the native-code compiler needs an assembler, and installing an assembler under MacOS X apparently requires 3+ gigabytes of disk space. ocamlopt, for its part, compiles the fib routine in this microbenchmark to 24 PowerPC instructions, which, if I had an assembler, I could benchmark. (It might be more to the point to test the Intel OCaml.) Here's the ocamlopt PowerPC assembly; notice that it is much simpler: _camlFib__fib_57: mflr r0 addi r1, r1, -16 stw r0, 12(r1) L101: cmpwi r3, 5 bge L100 lwz r11, 12(r1) mtlr r11 li r3, 3 addi r1, r1, 16 blr L100: stw r3, 0(r1) addi r3, r3, -4 L102: bl _camlFib__fib_57 lwz r4, 0(r1) stw r3, 4(r1) addi r3, r4, -2 L103: bl _camlFib__fib_57 lwz r11, 12(r1) mtlr r11 lwz r4, 4(r1) add r7, r3, r4 addi r3, r7, -1 addi r1, r1, 16 blr And I got to thinking. Dynamic Code Generation ----------------------- If I write an implementation in OCaml, I get all of OCaml's shiny machinery for interpreting or compiling, but as far as I can tell, there is no "eval". There's the Dynlink module, which lets bytecode programs dynamically load bytecode files, so if I compiled stuff into OCaml, I could dynamically load the result (if I forgo native-code compilation); but the OCaml compiler isn't actually all that fast, so it might cause long pauses. But if I write an implementation in SBCL, I can dynamically compile code into a running SBCL process very easily, like with two lines of code: This is SBCL 0.9.15, an implementation of ANSI Common Lisp. ... * (defun return-const-fun (name const) `(defun ,name () ,const)) RETURN-CONST-FUN * (eval (return-const-fun 'foobar 40)) FOOBAR * (disassemble 'foobar) ; 11608A1A: BAA0000000 MOV EDX, 160 ; no-arg-parsing entry point ; 1F: 8D65F8 LEA ESP, [EBP-8] ; 22: F8 CLC ; 23: 8B6DFC MOV EBP, [EBP-4] ; 26: C20400 RET 4 ... a bunch of nops and boilerplate omitted ... The first line defines a function that generates code; the second one invokes it and compiles the result, into, as it turns out, five machine instructions. (160 is 40, shifted left by two bits.) And I have the impression that generating code in Common Lisp will be easier than generating code in OCaml, C, or assembler. Java ---- Oh, yeah, Java is even faster: class Fib { public static int fib(int n) { if (n < 2) return 1; return fib(n-1) + fib(n-2); } public static void main(String[] argv) { System.out.println(new Integer(fib(40)).toString()); } } That returns 165580141 in 2.4 seconds, about 69 million base-case recursions per second, or 14 nanoseconds each. But I'm not feeling tempted by Java. Sorry. Squeak ------ I'm also a little tempted by Smalltalk. The Smalltalk version is fib: n n < 2 ifTrue: [^1] ifFalse: [^(self fib: n - 1) + (self fib: n - 2)] In Squeak 3.8-6665full on an Intel VM, that takes 4410 ms to run fib: 35, which is 14930352; that's 0.295 μs per base-case recursion, or 3.4 million per second, slightly slower than OCaml. Which is fairly impressive, considering that it's doing the same level of dynamic dispatch as SBCL, but in a bytecode engine like OCaml's. The bytecode looks like this: 9 <10> pushTemp: 0 10 <77> pushConstant: 2 11 <B2> send: < 12 <99> jumpFalse: 15 13 <76> pushConstant: 1 14 <7C> returnTop 15 <70> self 16 <10> pushTemp: 0 17 <76> pushConstant: 1 18 <B1> send: - 19 <E0> send: fib: 20 <70> self 21 <10> pushTemp: 0 22 <77> pushConstant: 2 23 <B1> send: - 24 <E0> send: fib: 25 <B0> send: + 26 <7C> returnTop Or, as I read it in pseudo-FORTH, "n 2 < if 1 return then self n 1 - recurse self n 2 - recurse + return". The <E0> bytecode that calls #fib: seems to be implemented by a method called sendLiteralSelectorBytecode, which occupies all of the bytecodes <D0> through <FF>, with the last four bits indexing into a "literal table" associated with the method. It's impressive to me that Squeak got this into 18 bytes of bytecode (plus 8 bytes of other stuff), considering that the original is 27 tokens. (Subtracting all the delimiters and considering #ifTrue:ifFalse: as one "token", we get down to 18 source tokens.) Python's bytecode ----------------- Compare the Python bytecode, from dis.dis(fib.fib): 3 0 LOAD_FAST 0 (n) 3 LOAD_CONST 1 (2) 6 COMPARE_OP 0 (<) 9 JUMP_IF_FALSE 8 (to 20) 12 POP_TOP 13 LOAD_CONST 2 (1) 16 RETURN_VALUE 17 JUMP_FORWARD 1 (to 21) >> 20 POP_TOP 4 >> 21 LOAD_GLOBAL 1 (fib) 24 LOAD_FAST 0 (n) 27 LOAD_CONST 2 (1) 30 BINARY_SUBTRACT 31 CALL_FUNCTION 1 34 LOAD_GLOBAL 1 (fib) 37 LOAD_FAST 0 (n) 40 LOAD_CONST 1 (2) 43 BINARY_SUBTRACT 44 CALL_FUNCTION 1 47 BINARY_ADD 48 RETURN_VALUE 49 LOAD_CONST 0 (None) 52 RETURN_VALUE Python's bytecode, very similar in design, uses 53 bytes for the same method! Much of the difference comes from Python using three bytes for anything that takes a parameter, so it only uses a single bytecode (called LOAD_FAST) for pushing parameters and the like, where Squeak uses all the bytes from <10> to <1F> for pushing the first 16 local variables; then there's <80> for up to 32 local variables. I don't know what it does when you have more than 32 local variables. Similarly, CALL_FUNCTION takes two bytes to tell it how many arguments the function has, I guess in case you have a function with 16383 arguments, and additionally it's separated from the LOAD_GLOBAL bytecode. Then Python's JUMP_IF_FALSE doesn't pop the boolean value it uses, so we need a separate bytecode for that, and then there are dead bytecodes at 17, 49, and 50, immediately after RETURN_VALUE bytecodes, and a dead bytecode at 20, immediately after an unconditional jump. The consequence of all of this is that Python is only currently using 109 of the 256 possible bytecodes, that Python bytecode is bloated and slow, and that the Python compiler and bytecode interpreter are fairly simple. Other Bytecodes --------------- As far as I can tell, OCaml doesn't come with anything analogous to SBCL's (disassemble 'fib), Python's dis module, or Squeak's browser option "What to show: bytecodes", so all I know about the OCaml bytecode is that the whole fib.cmo bytecode file produced by ocamlc was 320 bytes, and that .cmo files seem to largely consist of 32-bit big-endian binary two's-complement integers, with a big blob of stuff at the end containing strings and other data. I don't know anything about the implementation of the Ruby interpreter. perl -MO=Concise,fib fib.pl (using the B::Concise module) tells me the following: main::fib: 7 <1> leavesub[1 ref] K/REFC,1 ->(end) - <@> lineseq KP ->7 1 <;> nextstate(main 2 fib.pl:3) v/2 ->2 - <1> null K/1 ->- 5 <|> cond_expr(other->6) K/1 ->8 4 <2> lt sK/2 ->5 - <1> ex-aelem sK/2 ->3 - <1> ex-rv2av sKR/3 ->- 2 <#> aelemfast[*_] s ->3 - <0> ex-const s ->- 3 <$> const[IV 2] s ->4 6 <$> const[IV 1] s ->7 k <2> add[t10] sK/2 ->7 d <1> entersub[t5] sKS/TARG,3 ->e - <1> ex-list sK ->d 8 <0> pushmark s ->9 b <2> subtract[t4] sKM/2 ->c - <1> ex-aelem sK/2 ->a - <1> ex-rv2av sKR/3 ->- 9 <#> aelemfast[*_] s ->a - <0> ex-const s ->- a <$> const[IV 1] s ->b - <1> ex-rv2cv sK/3 ->- c <#> gv[*fib] s/EARLYCV ->d j <1> entersub[t9] sKS/TARG,3 ->k - <1> ex-list sK ->j e <0> pushmark s ->f h <2> subtract[t8] sKM/2 ->i - <1> ex-aelem sK/2 ->g - <1> ex-rv2av sKR/3 ->- f <#> aelemfast[*_] s ->g - <0> ex-const s ->- g <$> const[IV 2] s ->h - <1> ex-rv2cv sK/3 ->- i <#> gv[*fib] s/EARLYCV ->j This is pretty confusing; it helps to know the following operations occur before and to the left of their operands; the two "entersub" nodes are the recursive calls and the const[IV 2] nodes are integers such as 2; Perl's treecode runs with a stack as its working memory; it builds lists on the stacks PostScript-style, starting with a "pushmark"; every procedure call involves constructing a list of parameters. There's a '-exec' option to B::Concise that puts things in execution order, which makes it look a lot like Python, but leaves out all the ops labeled above with a "-". Java is yet another stack-based bytecode virtual machine. javap -c Fib says: public static int fib(int); Code: 0: iload_0 1: iconst_2 2: if_icmpge 7 5: iconst_1 6: ireturn 7: iload_0 8: iconst_1 9: isub 10: invokestatic #2; //Method fib:(I)I 13: iload_0 14: iconst_2 15: isub 16: invokestatic #2; //Method fib:(I)I 19: iadd 20: ireturn I suspect that the numbers in the left column are byte offsets, which would make this method 21 bytes, but I don't know enough about Java bytecode to be sure. Elisp compiles the same Lisp I fed to SBCL into the following stack-based bytecode: byte code for fib: args: (n) 0 varref n 1 constant 2 2 lss 3 goto-if-nil 1 6 constant 1 7 return 8:1 constant fib 9 varref n 10 sub1 11 call 1 12 constant fib 13 varref n 14 constant 2 15 diff 16 call 1 17 plus 18 return Or, more briefly, asciified with cat -vt: (defalias 'fib #[(n) "[EMAIL PROTECTED]" [n 2 1 fib] 4]) That is 19 bytes of bytecode, although much of it is represented with octal backslash-escapes. The bytecode is not documented in section 16.8 of the Elisp Reference Manual, "Disassembled Byte-Code". Bytecode as an Implementation Strategy -------------------------------------- Bytecode has several claimed advantages over dynamic compilation as a strategy for implementing high-level languages: it's smaller than machine code, interactive compilation is faster because the compiler does less work, it's an architecture-neutral distribution format, and porting to new platforms is easier. There are, however, some disadvantages. * It's Smaller Than Machine Code The compiled optimized SBCL version totaled 247 bytes of machine code. The unoptimized version, with calls to GENERIC-< and the like, was 162 bytes. And this is on the x86, which has historically been pretty good for code density, although as you can see in the above, the 32-bit immediate constants diminish that advantage significantly; 60 of the 247 bytes are in 32-bit constants. The OCaml version is 24 instructions, 8 of which have immediate constants. I don't know very much about PowerPC assembly, but let's suppose that every instruction is 32 bits, including any immediate constants; that means the whole function weighs 96 bytes. Compare this to the 26 bytes of the Squeak version, and you can see that bytecode systems have a compelling advantage in situations where code size is critical. Even indirect-threaded code would probably have been at least 32 bytes, assuming 16-bit addresses. So at least this claimed advantage is well-founded. If you write the same program in the same way in SBCL and in Squeak, it is likely to take considerably more code space in SBCL. * Smaller Than How Much Machine Code? Speaking of code compactness, though, for small programs, the machine-code version of your program may be smaller than the bytecode interpreter. I tried to build the C version of Squeak with "VMMaker new generateEntire" to see how big it was, but apparently there are some .h files and things that live outside of the Squeak image; the best I can say is that Interpreter.st is 337404 bytes and ObjectMemory.st is 96646 bytes, for a total of 434050 bytes, about 7000-10000 lines of Smalltalk, gzipping to 98274 bytes. I am guessing that the machine-code version would be around 100KiB. Presumably you can build an almost arbitrarily-small bytecode interpreter (say, one that implements just the S and K combinators) but the question is how big the bytecode interpreter has to be to make the bytecode version of your program smaller than the corresponding machine-code version, while still running at an acceptable speed. Chuck Moore's Novix NC-4000 and MuP21 work suggests that a useful bytecode interpreter could be very small indeed; the MuP21 is a 6000-transistor chip in which the CPU (there's also some other stuff on the chip) executes a stream of 5-bit zero-operand two-stack operations packed into 20-bit words. (I forget how LITERAL, aka load-immediate, is implemented.) The code for the machine is claimed to be quite compact. Unfortunately, the follow-on chips (the F21, the iTV i21, and the 25x) have not been successfully fabricated. * Interactive Compilation is Faster This advantage is probably irrelevant now, except for very small machines, say a 70MHz LPC2103 microcontroller. The SBCL compiler compiled the unoptimized Lisp version in 12ms on a 1.8GHz dual-core CPU; a cleverer implementation, like Apple ][ Integer Basic, could parse and compile as you typed, keeping pauses to a minimum. * Architecture-Neutral Distribution Format That is to say, you don't lose your software when you change CPU architectures. Source code is also an architecture-neutral distribution format; I am going to ignore the political reasons for not distributing source code and focus on the technical reason, which is that bytecode is smaller. In effect, the claimed advantage is that bytecode is a good compression algorithm for source code. If this is the case, we should perhaps compare it against generic compression algorithms like gzip. Here are some results from gzip -9: file size gzipped ratio language bicicleta_lexer.mll 2380 965 2.5 ocamllex readbmp.py 5022 1910 2.6 Python js-calc.js 15520 5767 2.7 JavaScript webrick/cgi.rb 6779 2251 3.0 Ruby cursmail.py 11627 3938 3.0 Python bicicleta.ml 13426 3461 3.9 OCaml I think these are fairly typical compression ratios for gzip on source code. The Perl source code version of the fib function is 60 bytes; OCaml is 61, Ruby is 62, unoptimized Lisp is 63, Python is 66, Squeak is 75 if you put it all on one line, Java is 87 if you put it all one line, and Tcl is 108. So if 2.7 is a typical compression ratio and 63 bytes is a typical size for implementations of this function, we'd expect gzip to produce a result of about 23.3 bytes; if you apply the 2.7 ratio to the Smalltalk function, you get 27.8 bytes. These compare favorably with the Squeak bytecode version at 27 bytes. For larger functions, bytecode seems to work better as compression. Here's SkipList>>add:ifPresent:, which was the first largish method I found while looking around Squeak at random: add: element ifPresent: aBlock | node lvl s | node := self search: element updating: splice. node ifNotNil: [aBlock ifNotNil: [^ aBlock value: node]]. lvl := self randomLevel. node := SkipListNode on: element level: lvl. level + 1 to: lvl do: [:i | splice at: i put: self]. 1 to: lvl do: [:i | s := splice at: i. node atForward: i put: (s forward: i). s atForward: i put: node]. numElements := numElements + 1. splice atAllPut: nil. ^ element That's 619 bytes of source if tabified, or 465 bytes if untabified. The bytecode representation is 118 bytes, including 32 bytes of non-bytecode (literal and constant tables, I guess?) and 86 bytes of bytecode. That's a compression ratio of 3.9 to 5.2, which is substantially better than gzip would do. (Maybe I should try 7-zip, but it's not installed on this Mac. bzip2 does worse than gzip on such small files.) The names of the method selectors and classes referenced herein are not included in those 32 bytes --- rather, there are seven four-byte pointers: #search:updating: #randomLevel, #on:level:, SkipListNode, #atForward:put:, #forward:, and #atAllPut:. (The other method selectors, like #+, #at:, #value: and #at:put:, have special bytecodes for them.) These 69 bytes have to be included somewhere in a bytecode file intended for use in an architecture-neutral distribution format that supports dynamic linking, but they only have to be included once, and if you don't need to link dynamically or modify the software after delivery, they don't need to be included at all. So it looks like bytecode is a somewhat more compact architecture- neutral distribution format than gzipped source code; the latter has some additional technical advantages, such as being easier to modify, being easier to optimize, containing comments, and not requiring a special-purpose compressor. However, this advantage is less compelling than the code-size advantage; while Squeak needed a factor of 3 to 10 less code space than native-code compilation systems, it looks like bytecode uses more like a factor of 1 or 2 less space than gzipped source code --- in a couple of trivial examples, which might not be completely realistic. * Porting to new platforms is easier I'm not sure if this is true, but it seems plausible: a native-code generator is more tightly coupled to the CPU architecture than a bytecode interpreter, because if the bytecode interpreter is written in a language available on the new target platform, you may be able to port it with a simple recompile. This may seem irrelevant today when all "personal computers" and most servers use the Intel 80386 instruction set, but there is a lot of uncertainty on the horizon: - ARM is still very popular on mobile phones, which are computers, are personal, and outnumber "PCs" perhaps ten to one; - AVR and PIC are very popular in the deep-embedded space; - the smoothness of Apple's recent move to Intel processors demonstrates that the barrier to entry for new CPU architectures is at an all-time low; - most networked appliances (from companies like Cisco, D-Link, Axis, and Broadcom) use non-386 CPUs that run Linux, and there are a huge number of them; - the Cell chip in the Playstation 3 has two different kinds of non-386 CPUs, and like the above, game consoles outnumber "PCs", but unlike the above, have enormous computational power; - after the coming transition to highly concurrent software takes place, the metric by which CPUs are valued may change from serial processing throughput to aggregate processing throughput, which may create an opportunity for disruptive CPU innovation. But, for the moment, portability matters much less than it used to. * Other advantages? It could be argued that bytecode compiler/interpreters are simpler than dynamic native-code compilers. I don't have enough experience to know whether this is true in general, but I'm pretty sure that if you're running on top of an environment like SBCL, it isn't. Even spawning an instance of gcc and dlopening the resulting object is likely to be an acceptable speed, and then you only have to generate C code. (You have to be careful not to #include any large .h files, though.) Parse-tree walkers, such as the Perl interpreter and the kind of Lisp interpreter found in books like SICP and EOPL, are simpler than either one, but they tend to be slower too. * Disadvantages Bytecode interpreters are usually pretty slow; the bytecode interpreter used by Squeak probably accounts for most of the difference between it (295ns) and unoptimized SBCL (52ns). JIT bytecode interpreters, such as recent Java virtual machines, can be less slow, but they clearly lose the code-size advantage normally enjoyed by bytecode interpreters, the postulated simplicity advantage, and probably the compilation-time advantage. * Conclusion Bytecode interpreters can offer unsurpassed code compactness, although Python's and OCaml's bytecode formats completely fail to do so. This matters a lot on microcontrollers, but I don't know whether it matters a lot or a little on modern personal computer CPUs --- this Mac has 2GiB of RAM, but only 2MiB of L2 cache, and presumably something like 16-64KiB of L1 cache. Thrashing the cache is soundly punished. They also offer a size advantage over gzipped source as a software distribution format, but this advantage is less compelling. This suggests that they ought to be popular in the demoscene, where there's apparently a lot of action in the "coolest audiovisual demo under N bytes" categories these days, for values of N such as 128, 256, 4096, and 65536; if they aren't, it casts doubt on this proposed advantage. I know of no other reason to implement an interpreter using bytecode. So I'm surprised it's such a popular thing to do! I think the reason is probably that code space and compilation time used to be quite precious resources (not to mention portability), and programmers just haven't adjusted to the new realities.