# Simple token-threaded "Forth" interpreter. -*- asm -*-
# by Kragen Javier Sitaker; dedicated to the public domain,
# i.e. I relinquish whatever exclusive rights copyright law
# gives me with regard to this work.
# Major parts taken from Richard W.M. Jones's public-domain
# JONESFORTH 42 by Richard W.M. Jones <[EMAIL PROTECTED]>
# http://annexia.org/forth

# This program just outputs "hello, world, hello" under Linux.

# to compile:   
# gcc -m32 -nostdlib -static -o tokthr2 tokthr2.S


### Why Small Things are Interesting

# There are still a lot of computers out there that have tens of
# kilobytes of memory or less, and they cost a lot less than,
# say, a cellphone.  Cellphones are apparently still too
# expensive for half the world's population.  I want to see how
# close I can get to having a comfortable programming
# environment in a smaller device.

# Some smallish microcontroller chips from five different
# manufacturers, with current Digi-Key prices:
# Name              bytes RAM  bytes ROM  MHz  price    
# ATtiny2313        128        2048       20   US$1.38  
# ATMega48-20AU     512        4096       20   US$1.62  
# MSP430F1111AIPW   128        2264       8    US$2.43  
# LPC2101           2048       8192       70   US$2.52  
# H8/300H Tiny      1536       8192       12   US$3.58  
# M16C/R8C/Tiny/1B  1024       16384      12   US$3.54  
# SX28AC/SS         136        3072       50   US$2.79  

# More practically and short-termly, small projects can take
# less time to finish, and I feel like I need to learn about
# different approaches to implementing programming languages.

### Why This is Small

# The normal Forth representation of a function is as an array
# of pointers to the other functions it calls, in sequence; a
# few of those other functions may move the interpreter pointer
# around in that array, or snarf up a constant that's stored in
# the array, or stuff like that, but for the most part, the
# functions just get called in sequence.  This is called
# "threaded code" and it's fairly compact, especially on 16-bit
# systems where the pointers are only two bytes.

# A traditional approach taken by Forth implementations to
# reduce code size even further is called "token threading".
# Rather than making arrays of 16-bit or 32-bit pointers, they
# make lists of 8-bit indices into an array of pointers.  This
# has two advantages:

# 1. the indices are one fourth the size of a list of 32-bit
#    pointers;
# 2. it is possible to save these lists of indices somewhere
#    outside of memory and continue to use them even after
#    making some changes to the code, as long as the same
#    indices in the table have the same meanings.  So, for
#    example, you could write some boot firmware in this
#    "bytecode".

# It also has some disadvantages:
# 1. You run out of space in the table.  Even a fairly minimal
#    full Forth system contains close to 256 subroutines.  You
#    can mitigate this by packing, say, two 12-bit pointers
#    every three bytes, or maybe by having a special bytecode
#    that looks up the next byte in an extended table.
# 2. It's slower and makes the machine-code part of the program
#    take more space.  The traditional LODSW; JMP AX version of
#    $NEXT from the eForth Model, which fetches and executes the
#    next execution token in the threaded list, is three bytes
#    and two instructions; my 'next' here is 41 bytes and 14
#    instructions, which is big enough that I jump to it (2
#    bytes) rather than making an assembler macro.  Which blows
#    your branch target buffers to pieces.  Oh well.  The
#    performance penalty is probably two orders of magnitude
#    over native code, but I haven't measured it yet.  I
#    measured an earlier version on my 700MHz PIII laptop on an
#    empty loop at about a factor of 3.5 over simple
#    direct-threading, which in turn is on the order of 10 times
#    slower than machine code.

# Anyway, so this is an example program built using this
# technique.  It implements two Forthlike stacks and interpreted
# subroutines, but not yet the ability to define new subroutines
# at run-time.

### What's Here

# I've implemented all of the primitives from C. H. Ting and
# Bill Muench's public-domain (?) eForth Model 1.0, except for
# the following:
# - I haven't implemented their lowercase "next" (as in FOR
#   NEXT) because I think it's a bad idea, it's complex, and it
#   can be implemented at a higher level if you really need it;
# - I didn't implement !IO because it's a no-op in this context;
# - I haven't yet implemented ?RX, although I think it's
#   possible to implement it on top of syscall5, using select();

# However, most of it is untested and therefore probably broken.
# Procedure call and return and the system calls do work.

# Currently registers are used as follows:
# %esi --- interpreter pointer; points to next byte to execute
# %ebp --- return stack pointer; points to last thing pushed.  This stack, 
#          like the other one, grows downwards.
# %esp --- data stack pointer; points to last thing pushed.  This is
#          the processor's standard stack pointer; "push" and "pop"
#          instructions use it, which makes assembly code to use it
#          quite concise.  The Intel "call" and "ret" instructions
#          would also use this stack, but they aren't used in this
#          program.
# flags --- the "down" direction flag must be cleared.
#           Fortunately this seems to be the case by default.

# It's probably missing a couple of primitives needed because of
# the token-threading implementation strategy; the address of
# the token table probably needs to be knowable, at least.

# Direct and indirect threading, the normal Forth approaches to
# allowing unrestricted coexistence of words written in assembly
# language and interpreted Forth, both had heavy space costs
# here --- close to 100% for the bytecode currently in the
# system.  So the inner interpreter checks, for each bytecode,
# whether it is in the range of bytecodes whose interpretations
# are in native code, and picks the relevant code path.  This
# avoids consuming any space per-word for this distinction, but with
# what I assume is a heavy performance cost.

### How Small This Is

# eForth 1.0's machine-code part seems to be 171 instructions
# and 399 bytes, including some data that's mixed in there with
# it.

# Last I checked, this program uses only 19 different
# instructions: jmp, jz; push, pop, lodsb, lodsl, xchg, mov;
# movsbl, inc, and, xor, or, lea, cdq, rcl, add, idiv; and int.

# At the moment, this program is 229 bytes in 125 machine-code
# instructions, plus 13 bytes of read-only data, 104 bytes of
# token table for the 52 currently-defined words, and another
# 143 bytes of bytecode in those words, and then another 22
# bytes in the main program, for a total of 511 bytes.

# One important thing that's missing here is the dictionary
# structure, which will minimally use up another hundred bytes
# or so.

### Other Things I Tried

# I tried switching to caching the top of the data stack in a
# register, on the theory that it would shorten things like
# 'and'.  Currently 'and' is pop %eax; pop %ebx; and %ebx,
# %eax; push %eax; jmp next.  If top of stack is cached in %eax
# instead of being stored in memory, this becomes pop %ebx; and
# %ebx, %eax; jmp next, which is considerably shorter.
# However, most things don't change, and other things become
# longer due to the extra work to save top-of-stack.  I tried
# using both %ebx and %eax as the top-of-stack cache.

# In the version using %ebx as top-of-stack, the total size of the
# machine-code part was 216 bytes, 115 instructions, compared to 197
# bytes, 112 instructions for the version using the current strategy.
# In the version using %eax as top-of-stack, it was only 215 bytes,
# but that's still worse than 197.

# In previous versions, all routines were machine-code routines
# that you could just jmp to.  High-level bytecode words began
# with "call dolist", which took the saved %eip off the stack
# and stuck it in %esi.  Unfortunately, that added 5 bytes to
# each bytecode word; as I write this, the bytecode region is
# 143 bytes and contains 24 word definitions --- so 5 bytes each
# would have been 120 bytes of overhead, or 84%!  It also would
# have required a region to be both executable and writable to
# support run-time routine definition, which is kind of a pain
# thse days.

# In previous versions, the token-table entries were 32 bits
# each (instead of 16 bits as they are now), which added another
# 2 bytes of overhead per word.  There are currently 52 words,
# so that's another 104 bytes of shaved overhead.

### What's Wrong With This Program

# - It's a long way from doing anything useful.
# - Only these words are tested: hello, sub1, type, comma,
#   world, newline, dolit_s8, dot, bye, exit, branch_on_0,
#   c_bang, drop, dup, swap, negative, umplus, divmod,
#   syscall5, syscall3, zero, syscall1, rpop, rpush, one,
#   dolit_32, neg1, add, emit, tuck, udot, udot_nospc,
#   udot_nonzero, branch, invert, add1, negate, xor.  These are
#   not tested, and therefore may be broken: execute, bang, at,
#   c_at, rp_at, rp_bang, sp_at, sp_bang, over, and, or, r_at.
# - There's no dictionary structure yet.
# - It probably needs another couple of primitives.
# - There's no checking for stack overflow or underflow, but
#   they will break things.

### The Beginning of the Program

# .. include system library header so we know __NR_exit = 1 and
# __NR_write = 4
#include <asm/unistd.h>

### The token table

        ## I was frustrated with the unreadability of my
        ## bytecode lists; I was counting token table entries
        ## by hand and writing bytecodes numerically.  So I
        ## wrote a macro to help.

        ## Note that we are using a separate .subsection
        ## directive because gas 2.17 doesn't support putting
        ## that in the .pushsection line, even though it is
        ## documented to do so; see message from Maciej
        ## W. Rozycki on 2007-10-11, subject "Re: How to use
        ## .pushsection?",
        ## http://sourceware.org/ml/binutils/2007-10/msg00176.html
        ## for more details)

        ## The first few entries in the table of bytecodes are
        ## all defined in machine code; the rest are all
        ## defined in bytecode.  The inner interpreter examines
        ## each bytecode to determine which category it falls
        ## in in order to figure out how to execute it,
        ## including what base address to add its offset to.
        ## This sucks for extensibility but rocks for
        ## compactness.

        .macro define_bytecode name, origin
        .pushsection .data      # save current position, go to data section
        .subsection 1           # and subsection 1, where we put the addrs
        b_\name = (. - token_table) / 2 # define b_foo as the index of this ptr
        .ifeq b_\name - 256
        .error "\name got bytecode 256"
        .endif
        #.int \name              # insert pointer which will be resolved next:
        .short \name - \origin # insert offset which will be resolved next
        .popsection             # return to where we were, and
\name:                          # define the name
        .endm
        .macro defasm name
        define_bytecode \name, machine_code_primitives
        .endm
        .macro defbytes name
        define_bytecode \name, bytecode_start
        .endm

        .data 1                 # Start putting stuff in data subsection 1
        .align 4
        ## table to define the "bytecode" instructions
token_table:

        .data 3
instructions:
        # And here is the actual "program" in that bytecode.
        .byte b_hello           # string "hello" and count
        .byte b_sub1            # subtract 1 from count: "hell"
        .byte b_type            # spit it out
        .byte b_comma, b_type, b_world, b_type # ", world"
        .byte b_comma, b_type, b_hello, b_type, b_newline, b_type
        # test the "dot" command to print out numbers
        .byte b_dolit_s8, -120, b_dot
        .byte b_dolit_s8, 104, b_dot, b_newline, b_type
        .byte b_bye

### The Return Stack
# We put Forth return addresses here, but programs can also use
# it for other purposes.

        .bss
        .space 4096
initial_return_stack_pointer:   

### Initialization
                
        .text                   # the following stuff goes in the text segment
        .global _start          # declare _start as a global symbol 
                                # (otherwise ld won't be able to find it)
_start:                         # this is the entry point for ELF I guess
        mov $initial_return_stack_pointer, %ebp
        mov $instructions, %esi # %esi is the interpreter pointer register
        jmp next                # and now we start the interpreter.
                                # (somewhat silly since we could just
                                # fall through..)

### The Machine-Code Primitives
# Also next (aka the address interpreter or inner interpreter)
# is in this section.

machine_code_primitives:
                
# dolit_s8 takes a signed 8-bit literal from the instruction
# stream and pushes it onto the stack.

        defasm dolit_s8
        lodsb
        movsbl %al, %eax
        push %eax
        jmp next

        defasm dolit_32         # more general dolit
        lodsl
        push %eax
        jmp next
        
        defasm exit             # Return from a colon defn.
        xchg %ebp, %esp
        pop %esi
        xchg %ebp, %esp
        jmp next

        defasm execute          # Run xt on data stack.
        pop %eax                # Here 'xt' is the one-byte token.
        jmp execute_eax

# Branch if top of stack is 0 (implementing IF).
# Both branch instructions take a signed byte offset from the bytecode
# stream.
        defasm branch_on_0
        pop %eax
        and %eax, %eax
        jz branch
        inc %esi                # skip 1-byte jump offset
        jmp next

        defasm branch
        lodsb
        movsbl %al, %eax        # same insn size as cbtw; cwde
        add %eax, %esi
        jmp next        

# Store a cell.
        defasm bang
        pop %ebx
        pop (%ebx)              # I'm amazed this is legal
        jmp next
# Fetch a cell.
        defasm at
        pop %ebx
        push (%ebx)             # I'm amazed this is legal too
        jmp next

# Store a byte.
        defasm c_bang
        pop %ebx
        pop %eax
        mov %al, (%ebx)         # push and pop don't do bytes
        jmp next

# Fetch a byte.
        defasm c_at
        pop %ebx
        xor %eax, %eax
        mov (%ebx), %al
        push %eax
        jmp next
        
# Get the return stack pointer.
        defasm rp_at
        push %ebp
        jmp next
        
# Set the return stack pointer.
        defasm rp_bang
        pop %ebp
        jmp next
        
# Pop the return stack to the data stack ( R> )
        defasm rpop
        push (%ebp)
        lea 4(%ebp), %ebp       # add or xchg/pop: same size
        jmp next
        
# Push the return stack from the data stack ( >R )
        defasm rpush
        lea -4(%ebp), %ebp
        pop (%ebp)
        jmp next

# Get the data stack pointer (before it gets pushed).
        defasm sp_at
        push %esp               # safe on 286 and later
        jmp next

# Set the data stack pointer.
        defasm sp_bang
        pop %esp
        jmp next
        
# Pop the stack.
        defasm drop
        pop %eax
        jmp next
        
# Push a copy of TOS.
# eForth 1.0 used BX to index the stack here, for a couple of
# reasons: on the 8086, SP got decremented prior to the fetch,
# and also wasn't valid as a base or index register.
        defasm dup
        push (%esp)
        jmp next
        
# Swap top two stack items ("exch" in PostScript)
        defasm swap
        pop %eax
        pop %ebx
        push %eax
        push %ebx
        jmp next
# Stack manipulation ( w1 w2 -- w1 w2 w1 )
# technically not necessary, but it's so easy and tiny
        defasm over
        push 4(%esp)           
#        jmp next               fall through because "next" is next
        
# "next" fetches the next bytecode and runs it.  It's placed
# here in the middle of the bytecode definitions so that more
# of them can use the short two-byte jump form to get to it.

next:
        xor %eax, %eax          # set %eax to 0
        xor %ebx, %ebx          # clear high half of %ebx
        lodsb                   # load %al from where %esi points
                                # (%esi is the interpreter pointer)
execute_eax:
        ## load offset of new word into %ebx
        mov token_table(,%eax,2), %bx  # bx := token_table[eax * 2bytes]
        cmp $last_asm_bytecode, %eax
        jbe next_primitive      # if primitive, handle primitive word
        ## otherwise, handle a bytecode definition or "colon list"
        # save old %esi on return stack
        xchg %ebp, %esp
        push %esi
        xchg %ebp, %esp
        lea bytecode_start(%ebx), %esi
        jmp next

next_primitive:
        lea machine_code_primitives(%ebx), %ebx
        jmp *%ebx
        

# Push true if n negative. ( n -- f )
        defasm negative
        pop %eax                
        cdq
        push %edx
        jmp next

# Bitwise operators:
        defasm and
        pop %eax
        pop %ebx
        and %ebx, %eax
        push %eax
        jmp next

        defasm or
        pop %eax
        pop %ebx
        or %ebx, %eax
        push %eax
        jmp next

        defasm xor
        pop %eax
        pop %ebx
        xor %ebx, %eax
        push %eax
        jmp next
        
# add two unsigned numbers, returning sum and carry.
# ( u1 u2 -- u3 cy )
        defasm umplus
        xor %ecx, %ecx
        pop %eax
        pop %ebx
        add %ebx, %eax
        rcl $1, %ecx
        push %eax
        push %ecx
        jmp next

# Divide double-precision by single-precision, unsigned.
# UM/MOD from eForth.  ( udl udh un -- ur uq )
        defasm divmod
        pop %ebx
        pop %edx
        pop %eax
        idiv %ebx
        push %edx
        push %eax
        jmp next

# Copy the top of the return stack onto the data stack.
# May be the traditional Forth word "I".
        defasm r_at
        push (%ebp)
        jmp next

# syscall5:   
# Linux system call with up to 5 arguments
# This is no longer the fashionable way to make system calls
# in Linux.  Now you're supposed to use SYSENTER on newer
# CPUs, and rather than have you figure out which one to use,
# the kernel mmaps a chunk of code called a VDSO into your
# memory space at a random address and tells you where to
# find it using the ELF auxiliary vector.  Then you're
# supposed to invoke the dynamic linker or something to parse
# the ELF executable mysteriously manifested in this way by
# the kernel, and then resolve an undefined symbol in libc
# into calls to it.  See "What is linux-gate.so.1?"
# http://www.trilithium.com/johan/2005/08/linux-gate/
# "The Linux kernel: System Calls" by Andries Brouwer, 2003-02-01
# http://www.win.tue.nl/%7Eaeb/linux/lk/lk-4.html
# "About ELF Auxiliary Vectors" by Manu Garg
# http://manugarg.googlepages.com/aboutelfauxiliaryvectors

# But the old int $0x80 approach still works, thank goodness,
# because all of that is *way* more than these ten
# instructions.
        defasm syscall5
        pop %edi
        ## we have to save %esi for the interpreter
        mov %esi, -4(%ebp)
        pop %esi
        pop %edx
        pop %ecx
        pop %ebx
        pop %eax
        int $0x80
        push %eax
        mov -4(%ebp), %esi
        jmp next

        last_asm_bytecode = b_syscall5

### Basic Interpreted Words
        ## a macro for defining interpreted words
        ## Because after I left off b_exit once, I wasted a long
        ## time trying to figure out what was wrong
        .macro def name, bytes:vararg
        defbytes \name
        .byte \bytes
        .byte b_exit
        .endm

        .data 2                 # separate subsection from token table
bytecode_start:
# System call with three arguments.
        def syscall3, b_zero, b_zero, b_syscall5
# System call with one argument.
        def syscall1, b_zero, b_zero, b_syscall3
        def bye, b_dolit_s8,__NR_exit, b_zero, b_syscall1 # exit program
        def zero, b_dolit_s8,0            # push 0
        def one, b_dolit_s8,1
        
# This word outputs a string whose address and count are on 
# the stack.  ( b u -- )

        defbytes type
        .byte b_rpush, b_rpush  # move two args onto rstack
                                # system call is __NR_write:    
        .byte b_dolit_s8,__NR_write
        .byte b_one             # push constant 1: stdout
        .byte b_rpop, b_rpop    # move two args back from rstack
        .byte b_syscall3        # call syscall with 3 args
        .byte b_drop            # discard return value
        .byte b_exit            # return

# The next few words exist just to poke string addresses
# and lengths onto the stack so "type" can print them.
        .macro make_counted_string name, contents
        defbytes \name
        .byte b_dolit_32        # dolit_32 pushes a 32-bit
        .int string_\name       # literal --- an addr, here
                                # now push literal length and return
        .byte b_dolit_s8,string_length_\name, b_exit
        .pushsection .rodata    # define the actual string:
string_\name:
        .ascii "\contents"
        ## Here we count the length of the string --- computers
        ## are for counting bytes so people don't have to!
        string_length_\name = . - string_\name
        .popsection
        .endm

        make_counted_string hello, "hello"
        make_counted_string world, "world"
        make_counted_string comma, ", "
        make_counted_string newline, "\n"

### Some More Basic Words

        def neg1, b_dolit_s8, -1   # ( -- -1 )
        def add, b_umplus, b_drop  # ( a b -- a+b ) drop the carry
        def sub1, b_neg1, b_add    # ( n -- n-1 )
        def rot, b_rpush, b_swap, b_rpop, b_swap # ( a b c -- b c a )
        def unrot, b_rot, b_rot    # ( a b c -- c a b )
        def tuck b_dup, b_unrot    # ( a b -- b a b )

# emit: output a single byte.  eForth calls this "TX!".

# This version is 11 bytes, including the buffer byte, plus the 2-byte
# token table pointer. a machine-code version I wrote the other day
# was 28 bytes.  However, I also added rot, unrot, and tuck to support
# this function, and they total 11 bytes, plus 6 bytes of overhead.
# For a total of 11+2+11+6 = 30 bytes.  Not winning yet on size over
# x86 asm!  But we're getting close.

emit_buffer:
        .byte 0
        defbytes emit
        .byte b_dolit_32
        .int emit_buffer
        .byte b_tuck            # save a copy of address for b_type
        .byte b_c_bang          # store into emit buffer
        .byte b_one, b_type, b_exit # output one-byte buffer

### "u." prints out an unsigned number.
# I had a version of this in x86 machine code in 52 bytes (23
# instructions), essentially exactly the same code as here.
# This is 31 bytes, plus 6 bytes of overhead, plus I had 
# to define b_divmod (9 bytes plus 2 bytes overhead).  Now we are
# starting to win!

        defbytes udot            # print space after number
        .byte b_udot_nospc, b_dolit_s8, 0x20, b_emit, b_exit 
        defbytes udot_nospc      # print number without space
        .byte b_dup, b_branch_on_0,2, b_udot_nonzero, b_exit
        .byte b_drop, b_dolit_s8, '0, b_emit, b_exit
        defbytes udot_nonzero
        .byte b_zero, b_dolit_s8,10, b_divmod # divide by 10
        .byte b_dup, b_branch_on_0,3, b_udot_nonzero # recurse if nonzero
        .byte b_branch,1        # else
        .byte b_drop            # drop zero quotient
        .byte b_dolit_s8, '0, b_add, b_emit # print digit
        .byte b_exit
        
### Add signed numeric output, ".".  This cost 20 bytes plus 8 bytes
# of overhead, but added some fundamental numeric operations; only 12
# of those 28 bytes are specific to "."

        # logical not: return true for 0, false (0) otherwise
        #def zeq, b_branch_on_0,2, b_zero, b_exit, b_neg1
        # logical bitwise not
        def invert, b_dolit_s8, -1, b_xor
        def add1, b_one, b_add
        # arithmetic negation
        def negate, b_invert, b_add1
        # print signed number
        defbytes dot
        .byte b_dup, b_negative, b_branch_on_0,4
        .byte b_dolit_s8, '-, b_emit, b_negate # in the negative case
        .byte b_udot, b_exit

### Obviously the next thing to do is to add ".S", print the
# stack, so that I can stop having to investigate problems by
# using gdb.

Reply via email to