# Simple token-threaded "Forth" interpreter. -*- asm -*- # by Kragen Javier Sitaker; dedicated to the public domain, # i.e. I relinquish whatever exclusive rights copyright law # gives me with regard to this work. # Major parts taken from Richard W.M. Jones's public-domain # JONESFORTH 42 by Richard W.M. Jones <[EMAIL PROTECTED]> # http://annexia.org/forth
# This program just outputs "hello, world, hello" under Linux. # to compile: # gcc -m32 -nostdlib -static -o tokthr2 tokthr2.S ### Why Small Things are Interesting # There are still a lot of computers out there that have tens of # kilobytes of memory or less, and they cost a lot less than, # say, a cellphone. Cellphones are apparently still too # expensive for half the world's population. I want to see how # close I can get to having a comfortable programming # environment in a smaller device. # Some smallish microcontroller chips from five different # manufacturers, with current Digi-Key prices: # Name bytes RAM bytes ROM MHz price # ATtiny2313 128 2048 20 US$1.38 # ATMega48-20AU 512 4096 20 US$1.62 # MSP430F1111AIPW 128 2264 8 US$2.43 # LPC2101 2048 8192 70 US$2.52 # H8/300H Tiny 1536 8192 12 US$3.58 # M16C/R8C/Tiny/1B 1024 16384 12 US$3.54 # SX28AC/SS 136 3072 50 US$2.79 # More practically and short-termly, small projects can take # less time to finish, and I feel like I need to learn about # different approaches to implementing programming languages. ### Why This is Small # The normal Forth representation of a function is as an array # of pointers to the other functions it calls, in sequence; a # few of those other functions may move the interpreter pointer # around in that array, or snarf up a constant that's stored in # the array, or stuff like that, but for the most part, the # functions just get called in sequence. This is called # "threaded code" and it's fairly compact, especially on 16-bit # systems where the pointers are only two bytes. # A traditional approach taken by Forth implementations to # reduce code size even further is called "token threading". # Rather than making arrays of 16-bit or 32-bit pointers, they # make lists of 8-bit indices into an array of pointers. This # has two advantages: # 1. the indices are one fourth the size of a list of 32-bit # pointers; # 2. it is possible to save these lists of indices somewhere # outside of memory and continue to use them even after # making some changes to the code, as long as the same # indices in the table have the same meanings. So, for # example, you could write some boot firmware in this # "bytecode". # It also has some disadvantages: # 1. You run out of space in the table. Even a fairly minimal # full Forth system contains close to 256 subroutines. You # can mitigate this by packing, say, two 12-bit pointers # every three bytes, or maybe by having a special bytecode # that looks up the next byte in an extended table. # 2. It's slower and makes the machine-code part of the program # take more space. The traditional LODSW; JMP AX version of # $NEXT from the eForth Model, which fetches and executes the # next execution token in the threaded list, is three bytes # and two instructions; my 'next' here is 41 bytes and 14 # instructions, which is big enough that I jump to it (2 # bytes) rather than making an assembler macro. Which blows # your branch target buffers to pieces. Oh well. The # performance penalty is probably two orders of magnitude # over native code, but I haven't measured it yet. I # measured an earlier version on my 700MHz PIII laptop on an # empty loop at about a factor of 3.5 over simple # direct-threading, which in turn is on the order of 10 times # slower than machine code. # Anyway, so this is an example program built using this # technique. It implements two Forthlike stacks and interpreted # subroutines, but not yet the ability to define new subroutines # at run-time. ### What's Here # I've implemented all of the primitives from C. H. Ting and # Bill Muench's public-domain (?) eForth Model 1.0, except for # the following: # - I haven't implemented their lowercase "next" (as in FOR # NEXT) because I think it's a bad idea, it's complex, and it # can be implemented at a higher level if you really need it; # - I didn't implement !IO because it's a no-op in this context; # - I haven't yet implemented ?RX, although I think it's # possible to implement it on top of syscall5, using select(); # However, most of it is untested and therefore probably broken. # Procedure call and return and the system calls do work. # Currently registers are used as follows: # %esi --- interpreter pointer; points to next byte to execute # %ebp --- return stack pointer; points to last thing pushed. This stack, # like the other one, grows downwards. # %esp --- data stack pointer; points to last thing pushed. This is # the processor's standard stack pointer; "push" and "pop" # instructions use it, which makes assembly code to use it # quite concise. The Intel "call" and "ret" instructions # would also use this stack, but they aren't used in this # program. # flags --- the "down" direction flag must be cleared. # Fortunately this seems to be the case by default. # It's probably missing a couple of primitives needed because of # the token-threading implementation strategy; the address of # the token table probably needs to be knowable, at least. # Direct and indirect threading, the normal Forth approaches to # allowing unrestricted coexistence of words written in assembly # language and interpreted Forth, both had heavy space costs # here --- close to 100% for the bytecode currently in the # system. So the inner interpreter checks, for each bytecode, # whether it is in the range of bytecodes whose interpretations # are in native code, and picks the relevant code path. This # avoids consuming any space per-word for this distinction, but with # what I assume is a heavy performance cost. ### How Small This Is # eForth 1.0's machine-code part seems to be 171 instructions # and 399 bytes, including some data that's mixed in there with # it. # Last I checked, this program uses only 19 different # instructions: jmp, jz; push, pop, lodsb, lodsl, xchg, mov; # movsbl, inc, and, xor, or, lea, cdq, rcl, add, idiv; and int. # At the moment, this program is 229 bytes in 125 machine-code # instructions, plus 13 bytes of read-only data, 104 bytes of # token table for the 52 currently-defined words, and another # 143 bytes of bytecode in those words, and then another 22 # bytes in the main program, for a total of 511 bytes. # One important thing that's missing here is the dictionary # structure, which will minimally use up another hundred bytes # or so. ### Other Things I Tried # I tried switching to caching the top of the data stack in a # register, on the theory that it would shorten things like # 'and'. Currently 'and' is pop %eax; pop %ebx; and %ebx, # %eax; push %eax; jmp next. If top of stack is cached in %eax # instead of being stored in memory, this becomes pop %ebx; and # %ebx, %eax; jmp next, which is considerably shorter. # However, most things don't change, and other things become # longer due to the extra work to save top-of-stack. I tried # using both %ebx and %eax as the top-of-stack cache. # In the version using %ebx as top-of-stack, the total size of the # machine-code part was 216 bytes, 115 instructions, compared to 197 # bytes, 112 instructions for the version using the current strategy. # In the version using %eax as top-of-stack, it was only 215 bytes, # but that's still worse than 197. # In previous versions, all routines were machine-code routines # that you could just jmp to. High-level bytecode words began # with "call dolist", which took the saved %eip off the stack # and stuck it in %esi. Unfortunately, that added 5 bytes to # each bytecode word; as I write this, the bytecode region is # 143 bytes and contains 24 word definitions --- so 5 bytes each # would have been 120 bytes of overhead, or 84%! It also would # have required a region to be both executable and writable to # support run-time routine definition, which is kind of a pain # thse days. # In previous versions, the token-table entries were 32 bits # each (instead of 16 bits as they are now), which added another # 2 bytes of overhead per word. There are currently 52 words, # so that's another 104 bytes of shaved overhead. ### What's Wrong With This Program # - It's a long way from doing anything useful. # - Only these words are tested: hello, sub1, type, comma, # world, newline, dolit_s8, dot, bye, exit, branch_on_0, # c_bang, drop, dup, swap, negative, umplus, divmod, # syscall5, syscall3, zero, syscall1, rpop, rpush, one, # dolit_32, neg1, add, emit, tuck, udot, udot_nospc, # udot_nonzero, branch, invert, add1, negate, xor. These are # not tested, and therefore may be broken: execute, bang, at, # c_at, rp_at, rp_bang, sp_at, sp_bang, over, and, or, r_at. # - There's no dictionary structure yet. # - It probably needs another couple of primitives. # - There's no checking for stack overflow or underflow, but # they will break things. ### The Beginning of the Program # .. include system library header so we know __NR_exit = 1 and # __NR_write = 4 #include <asm/unistd.h> ### The token table ## I was frustrated with the unreadability of my ## bytecode lists; I was counting token table entries ## by hand and writing bytecodes numerically. So I ## wrote a macro to help. ## Note that we are using a separate .subsection ## directive because gas 2.17 doesn't support putting ## that in the .pushsection line, even though it is ## documented to do so; see message from Maciej ## W. Rozycki on 2007-10-11, subject "Re: How to use ## .pushsection?", ## http://sourceware.org/ml/binutils/2007-10/msg00176.html ## for more details) ## The first few entries in the table of bytecodes are ## all defined in machine code; the rest are all ## defined in bytecode. The inner interpreter examines ## each bytecode to determine which category it falls ## in in order to figure out how to execute it, ## including what base address to add its offset to. ## This sucks for extensibility but rocks for ## compactness. .macro define_bytecode name, origin .pushsection .data # save current position, go to data section .subsection 1 # and subsection 1, where we put the addrs b_\name = (. - token_table) / 2 # define b_foo as the index of this ptr .ifeq b_\name - 256 .error "\name got bytecode 256" .endif #.int \name # insert pointer which will be resolved next: .short \name - \origin # insert offset which will be resolved next .popsection # return to where we were, and \name: # define the name .endm .macro defasm name define_bytecode \name, machine_code_primitives .endm .macro defbytes name define_bytecode \name, bytecode_start .endm .data 1 # Start putting stuff in data subsection 1 .align 4 ## table to define the "bytecode" instructions token_table: .data 3 instructions: # And here is the actual "program" in that bytecode. .byte b_hello # string "hello" and count .byte b_sub1 # subtract 1 from count: "hell" .byte b_type # spit it out .byte b_comma, b_type, b_world, b_type # ", world" .byte b_comma, b_type, b_hello, b_type, b_newline, b_type # test the "dot" command to print out numbers .byte b_dolit_s8, -120, b_dot .byte b_dolit_s8, 104, b_dot, b_newline, b_type .byte b_bye ### The Return Stack # We put Forth return addresses here, but programs can also use # it for other purposes. .bss .space 4096 initial_return_stack_pointer: ### Initialization .text # the following stuff goes in the text segment .global _start # declare _start as a global symbol # (otherwise ld won't be able to find it) _start: # this is the entry point for ELF I guess mov $initial_return_stack_pointer, %ebp mov $instructions, %esi # %esi is the interpreter pointer register jmp next # and now we start the interpreter. # (somewhat silly since we could just # fall through..) ### The Machine-Code Primitives # Also next (aka the address interpreter or inner interpreter) # is in this section. machine_code_primitives: # dolit_s8 takes a signed 8-bit literal from the instruction # stream and pushes it onto the stack. defasm dolit_s8 lodsb movsbl %al, %eax push %eax jmp next defasm dolit_32 # more general dolit lodsl push %eax jmp next defasm exit # Return from a colon defn. xchg %ebp, %esp pop %esi xchg %ebp, %esp jmp next defasm execute # Run xt on data stack. pop %eax # Here 'xt' is the one-byte token. jmp execute_eax # Branch if top of stack is 0 (implementing IF). # Both branch instructions take a signed byte offset from the bytecode # stream. defasm branch_on_0 pop %eax and %eax, %eax jz branch inc %esi # skip 1-byte jump offset jmp next defasm branch lodsb movsbl %al, %eax # same insn size as cbtw; cwde add %eax, %esi jmp next # Store a cell. defasm bang pop %ebx pop (%ebx) # I'm amazed this is legal jmp next # Fetch a cell. defasm at pop %ebx push (%ebx) # I'm amazed this is legal too jmp next # Store a byte. defasm c_bang pop %ebx pop %eax mov %al, (%ebx) # push and pop don't do bytes jmp next # Fetch a byte. defasm c_at pop %ebx xor %eax, %eax mov (%ebx), %al push %eax jmp next # Get the return stack pointer. defasm rp_at push %ebp jmp next # Set the return stack pointer. defasm rp_bang pop %ebp jmp next # Pop the return stack to the data stack ( R> ) defasm rpop push (%ebp) lea 4(%ebp), %ebp # add or xchg/pop: same size jmp next # Push the return stack from the data stack ( >R ) defasm rpush lea -4(%ebp), %ebp pop (%ebp) jmp next # Get the data stack pointer (before it gets pushed). defasm sp_at push %esp # safe on 286 and later jmp next # Set the data stack pointer. defasm sp_bang pop %esp jmp next # Pop the stack. defasm drop pop %eax jmp next # Push a copy of TOS. # eForth 1.0 used BX to index the stack here, for a couple of # reasons: on the 8086, SP got decremented prior to the fetch, # and also wasn't valid as a base or index register. defasm dup push (%esp) jmp next # Swap top two stack items ("exch" in PostScript) defasm swap pop %eax pop %ebx push %eax push %ebx jmp next # Stack manipulation ( w1 w2 -- w1 w2 w1 ) # technically not necessary, but it's so easy and tiny defasm over push 4(%esp) # jmp next fall through because "next" is next # "next" fetches the next bytecode and runs it. It's placed # here in the middle of the bytecode definitions so that more # of them can use the short two-byte jump form to get to it. next: xor %eax, %eax # set %eax to 0 xor %ebx, %ebx # clear high half of %ebx lodsb # load %al from where %esi points # (%esi is the interpreter pointer) execute_eax: ## load offset of new word into %ebx mov token_table(,%eax,2), %bx # bx := token_table[eax * 2bytes] cmp $last_asm_bytecode, %eax jbe next_primitive # if primitive, handle primitive word ## otherwise, handle a bytecode definition or "colon list" # save old %esi on return stack xchg %ebp, %esp push %esi xchg %ebp, %esp lea bytecode_start(%ebx), %esi jmp next next_primitive: lea machine_code_primitives(%ebx), %ebx jmp *%ebx # Push true if n negative. ( n -- f ) defasm negative pop %eax cdq push %edx jmp next # Bitwise operators: defasm and pop %eax pop %ebx and %ebx, %eax push %eax jmp next defasm or pop %eax pop %ebx or %ebx, %eax push %eax jmp next defasm xor pop %eax pop %ebx xor %ebx, %eax push %eax jmp next # add two unsigned numbers, returning sum and carry. # ( u1 u2 -- u3 cy ) defasm umplus xor %ecx, %ecx pop %eax pop %ebx add %ebx, %eax rcl $1, %ecx push %eax push %ecx jmp next # Divide double-precision by single-precision, unsigned. # UM/MOD from eForth. ( udl udh un -- ur uq ) defasm divmod pop %ebx pop %edx pop %eax idiv %ebx push %edx push %eax jmp next # Copy the top of the return stack onto the data stack. # May be the traditional Forth word "I". defasm r_at push (%ebp) jmp next # syscall5: # Linux system call with up to 5 arguments # This is no longer the fashionable way to make system calls # in Linux. Now you're supposed to use SYSENTER on newer # CPUs, and rather than have you figure out which one to use, # the kernel mmaps a chunk of code called a VDSO into your # memory space at a random address and tells you where to # find it using the ELF auxiliary vector. Then you're # supposed to invoke the dynamic linker or something to parse # the ELF executable mysteriously manifested in this way by # the kernel, and then resolve an undefined symbol in libc # into calls to it. See "What is linux-gate.so.1?" # http://www.trilithium.com/johan/2005/08/linux-gate/ # "The Linux kernel: System Calls" by Andries Brouwer, 2003-02-01 # http://www.win.tue.nl/%7Eaeb/linux/lk/lk-4.html # "About ELF Auxiliary Vectors" by Manu Garg # http://manugarg.googlepages.com/aboutelfauxiliaryvectors # But the old int $0x80 approach still works, thank goodness, # because all of that is *way* more than these ten # instructions. defasm syscall5 pop %edi ## we have to save %esi for the interpreter mov %esi, -4(%ebp) pop %esi pop %edx pop %ecx pop %ebx pop %eax int $0x80 push %eax mov -4(%ebp), %esi jmp next last_asm_bytecode = b_syscall5 ### Basic Interpreted Words ## a macro for defining interpreted words ## Because after I left off b_exit once, I wasted a long ## time trying to figure out what was wrong .macro def name, bytes:vararg defbytes \name .byte \bytes .byte b_exit .endm .data 2 # separate subsection from token table bytecode_start: # System call with three arguments. def syscall3, b_zero, b_zero, b_syscall5 # System call with one argument. def syscall1, b_zero, b_zero, b_syscall3 def bye, b_dolit_s8,__NR_exit, b_zero, b_syscall1 # exit program def zero, b_dolit_s8,0 # push 0 def one, b_dolit_s8,1 # This word outputs a string whose address and count are on # the stack. ( b u -- ) defbytes type .byte b_rpush, b_rpush # move two args onto rstack # system call is __NR_write: .byte b_dolit_s8,__NR_write .byte b_one # push constant 1: stdout .byte b_rpop, b_rpop # move two args back from rstack .byte b_syscall3 # call syscall with 3 args .byte b_drop # discard return value .byte b_exit # return # The next few words exist just to poke string addresses # and lengths onto the stack so "type" can print them. .macro make_counted_string name, contents defbytes \name .byte b_dolit_32 # dolit_32 pushes a 32-bit .int string_\name # literal --- an addr, here # now push literal length and return .byte b_dolit_s8,string_length_\name, b_exit .pushsection .rodata # define the actual string: string_\name: .ascii "\contents" ## Here we count the length of the string --- computers ## are for counting bytes so people don't have to! string_length_\name = . - string_\name .popsection .endm make_counted_string hello, "hello" make_counted_string world, "world" make_counted_string comma, ", " make_counted_string newline, "\n" ### Some More Basic Words def neg1, b_dolit_s8, -1 # ( -- -1 ) def add, b_umplus, b_drop # ( a b -- a+b ) drop the carry def sub1, b_neg1, b_add # ( n -- n-1 ) def rot, b_rpush, b_swap, b_rpop, b_swap # ( a b c -- b c a ) def unrot, b_rot, b_rot # ( a b c -- c a b ) def tuck b_dup, b_unrot # ( a b -- b a b ) # emit: output a single byte. eForth calls this "TX!". # This version is 11 bytes, including the buffer byte, plus the 2-byte # token table pointer. a machine-code version I wrote the other day # was 28 bytes. However, I also added rot, unrot, and tuck to support # this function, and they total 11 bytes, plus 6 bytes of overhead. # For a total of 11+2+11+6 = 30 bytes. Not winning yet on size over # x86 asm! But we're getting close. emit_buffer: .byte 0 defbytes emit .byte b_dolit_32 .int emit_buffer .byte b_tuck # save a copy of address for b_type .byte b_c_bang # store into emit buffer .byte b_one, b_type, b_exit # output one-byte buffer ### "u." prints out an unsigned number. # I had a version of this in x86 machine code in 52 bytes (23 # instructions), essentially exactly the same code as here. # This is 31 bytes, plus 6 bytes of overhead, plus I had # to define b_divmod (9 bytes plus 2 bytes overhead). Now we are # starting to win! defbytes udot # print space after number .byte b_udot_nospc, b_dolit_s8, 0x20, b_emit, b_exit defbytes udot_nospc # print number without space .byte b_dup, b_branch_on_0,2, b_udot_nonzero, b_exit .byte b_drop, b_dolit_s8, '0, b_emit, b_exit defbytes udot_nonzero .byte b_zero, b_dolit_s8,10, b_divmod # divide by 10 .byte b_dup, b_branch_on_0,3, b_udot_nonzero # recurse if nonzero .byte b_branch,1 # else .byte b_drop # drop zero quotient .byte b_dolit_s8, '0, b_add, b_emit # print digit .byte b_exit ### Add signed numeric output, ".". This cost 20 bytes plus 8 bytes # of overhead, but added some fundamental numeric operations; only 12 # of those 28 bytes are specific to "." # logical not: return true for 0, false (0) otherwise #def zeq, b_branch_on_0,2, b_zero, b_exit, b_neg1 # logical bitwise not def invert, b_dolit_s8, -1, b_xor def add1, b_one, b_add # arithmetic negation def negate, b_invert, b_add1 # print signed number defbytes dot .byte b_dup, b_negative, b_branch_on_0,4 .byte b_dolit_s8, '-, b_emit, b_negate # in the negative case .byte b_udot, b_exit ### Obviously the next thing to do is to add ".S", print the # stack, so that I can stop having to investigate problems by # using gdb.