# Simple direct-token-threaded "Forth" interpreter. -*- asm -*-
# by Kragen Javier Sitaker; dedicated to the public domain,
# i.e. I relinquish whatever exclusive rights copyright law
# gives me with regard to this work.
# Major parts taken from Richard W.M. Jones's public-domain
# JONESFORTH 42 by Richard W.M. Jones <[EMAIL PROTECTED]>
# http://annexia.org/forth

# This program just outputs "hello, world, hello" under Linux.

# to compile:   
# gcc -m32 -nostdlib -static -o tokthr tokthr.S


### Why Small Things are Interesting

# There are still a lot of computers out there that have tens of
# kilobytes of memory or less, and they cost a lot less than,
# say, a cellphone.  Cellphones are apparently still too
# expensive for half the world's population.  I want to see how
# close I can get to having a comfortable programming
# environment in a smaller device.

# Some smallish microcontroller chips from five different
# manufacturers, with current Digi-Key prices:
# Name              bytes RAM  bytes ROM  MHz  price    
# ATtiny2313        128        2048       20   US$1.38  
# ATMega48-20AU     512        4096       20   US$1.62  
# MSP430F1111AIPW   128        2264       8    US$2.43  
# LPC2101           2048       8192       70   US$2.52  
# H8/300H Tiny      1536       8192       12   US$3.58  
# M16C/R8C/Tiny/1B  1024       16384      12   US$3.54  
# SX28AC/SS         136        3072       50   US$2.79  

# More practically and short-termly, small projects can take
# less time to finish, and I feel like I need to learn about
# different approaches to implementing programming languages.

### Why This is Small

# The normal Forth representation of a function is as an array
# of pointers to the other functions it calls, in sequence; a
# few of those other functions may move the interpreter pointer
# around in that array, or snarf up a constant that's stored in
# the array, or stuff like that, but for the most part, the
# functions just get called in sequence.  This is called
# "threaded code" and it's fairly compact, especially on 16-bit
# systems where the pointers are only two bytes.

# A traditional approach taken by Forth implementations to
# reduce code size even further is called "token threading".
# Rather than making arrays of 16-bit or 32-bit pointers, they
# make lists of 8-bit indices into an array of pointers.  This
# has two advantages:

# 1. the indices are one fourth the size of a list of 32-bit
#    pointers;
# 2. it is possible to save these lists of indices somewhere
#    outside of memory and continue to use them even after
#    making some changes to the code, as long as the same
#    indices in the table have the same meanings.  So, for
#    example, you could write some boot firmware in this
#    "bytecode".

# It also has some disadvantages:
# 1. You run out of space in the table.  Even a fairly minimal
#    full Forth system contains close to 256 subroutines.  You
#    can mitigate this by packing, say, two 12-bit pointers
#    every three bytes, or maybe by having a special bytecode
#    that looks up the next byte in an extended table.
# 2. It's slower and makes the machine-code part of the program
#    take more space.  The traditional LODSW; JMP AX version of
#    $NEXT from the eForth Model, which fetches and executes the
#    next execution token in the threaded list, is three bytes
#    and two instructions; my 'next' here is 12 bytes and four
#    instructions, which is big enough that I jump to it (5
#    bytes) rather than making an assembler macro.  Which blows
#    your branch target buffers to pieces.  Oh well.  I measured
#    the speed hit on my 700MHz PIII laptop on an empty loop at
#    about a factor of 3.5 over simple direct-threading, which
#    in turn is on the order of 10 times slower than machine
#    code.

# Anyway, so this is an example program built using this
# technique.  It implements two Forthlike stacks and interpreted
# subroutines, but not yet the ability to define new subroutines
# at run-time.

### What's Here

# I've implemented all of the primitives from C. H. Ting and
# Bill Muench's public-domain (?) eForth Model 1.0, except for
# the following:
# - I haven't implemented their lowercase "next" (as in FOR
#   NEXT) because I think it's a bad idea, it's complex, and it
#   can be implemented at a higher level if you really need it;
# - I didn't implment !IO because it's a no-op in this context;
# - I haven't yet implemented ?RX, although I think it's
#   possible to implement it on top of syscall5, using select();
# - I haven't implemented TX!, although it would be
#   straightforward to implement on top of "type".

# However, most of it is untested and therefore probably broken.
# Procedure call and return and the system calls do work.

# Currently registers are used as follows:
# %esi --- interpreter pointer; points to next byte to execute
# %ebp --- return stack pointer; points to last thing pushed
# %esp --- data stack pointer; points to last thing pushed
# flags --- the "down" direction flag must be cleared.
#           Fortunately this seems to be the case by default.

# I'll probably try the colorForth thing of keeping the top of
# the data stack in a register and see if that makes things
# smaller.  It probably won't make them faster, since most of
# this interpreter's time will be spent in the interpreting
# loop anyway.

# It's probably missing a couple of primitives needed because of
# the token-threading implementation strategy; the address of
# the token table probably needs to be knowable, at least.

### How Small This Is

# The pure machine-code part of the program runs from 0x08048074
# to 0x08048139, which is 197 bytes.  This is fairly small.  I
# haven't been able to compile or disassemble eForth 1.0 to see
# how big its machine-code part is, but it seemed to be 171
# instructions, while this is 112.  So this is fairly small.

# It uses only 19 different instructions: jmp, jz; push, pop,
# lodsb, lodsl, xchg, mov; movsbl, inc, and, xor, or, lea,
# cdq, rcl; int, ret, and call.

# The bytecode part of the program, including the five-byte
# machine-code header on each word, runs from 0x08048139 to
# 0x080481a8, which is another 111 bytes.  At present this just
# includes some system-call glue, a word to exit cleanly, words
# to push literal zero and one, a word to output a string, and a
# few constant strings.

# These two parts add up to 308 bytes.

# Additionally there is some read-only data: 38 pointers to
# words, 13 bytes of "main program", and then some string data:
# "helloworld, \n".  This totals 178 bytes.

# The whole program, therefore, is 486 bytes.  Why "strip" only
# reduces it to 836 bytes, I don't know.  But I don't think that
# problem will carry over to microcontroller contexts.

# One important thing that's missing here is the dictionary
# structure, which will minimally use up another hundred bytes
# or so.

### What's Wrong With This Program

# - It's a long way from doing anything useful.
# - The bytecode is numeric and unreadable, and it confuses
#   disassemblers.
# - Most of the primitives are untested and therefore probably
#   broken, except for dolit_s8, dolit_32, dolist, exit, and
#   syscall5.
# - There's no dictionary structure yet.
# - It probably needs another couple of primitives.
# - There's no checking for stack overflow or underflow, but
#   they will break things.

### The Beginning of the Program

# .. include system library header so we know __NR_exit = 1 and
# __NR_write = 4
#include <asm/unistd.h>

### The token table
        
        .section .rodata        # stuff goes in ELF's read-only data section
        .align 4
token_table:                    # table to define the "bytecode" instructions
        ## These are pointers to machine-code subroutines
        ##   0      1     2      3      4      5
        .int bye, type, hello, world, comma, newline
        ##     6         7          8      9     10
        .int dolit_s8, dolit_32, dolist, exit, execute
        ##       11        12      13   14    15     16    17
        .int branch_on_0, branch, bang, at, c_bang, c_at, rp_at
        ##     18      19    20     21      22      23    24
        .int rp_bang, rpop, rpush, sp_at, sp_bang, drop, dup
        ##    25    26      27     28   29  30     31     32
        .int swap, over, negative, and, or, xor, umplus, r_at
        ##     33        34       35   36     37
        .int syscall5, syscall1, zero, one, syscall3
instructions:
        # And here is the actual "program" in that bytecode.
        .byte 2, 1, 4, 1, 3, 1, 4, 1, 2, 1, 5, 1, 0

### The Return Stack
# We put Forth return addresses here, but programs can also use
# it for other purposes.

        .bss
        .space 4096
initial_return_stack_pointer:   

### Initialization
                
        .text                   # the following stuff goes in the text segment
        .globl _start           # declare _start as a global symbol 
                                # (otherwise ld won't be able to find it)
_start:                         # this is the entry point for ELF I guess
        mov $initial_return_stack_pointer, %ebp
        mov $instructions, %esi # %esi is the interpreter pointer register
        jmp next                # and now we start the interpreter.
                                # (somewhat silly since we could just
                                # fall through..)

### The Machine-Code Primitives
# Also dolist (aka DOCOL) and next (aka the address interpreter
# or inner interpreter) are in this section.

# dolit_s8 takes a signed 8-bit literal from the instruction
# stream and pushes it onto the stack.
dolit_s8:
        lodsb
        movsbl %al, %eax
        push %eax
        jmp next
dolit_32:                       # more general dolit
        lodsl
        push %eax
        jmp next
# Process colon list by popping saved %eip from stack and 
# putting it into %esi, saving old %esi on return stack.
dolist:                         # approach from eForth 1.0
        xchg %ebp, %esp         # 8 bytes, 5 insns.
        push %esi               # The sub $4, %ebp (or lea)
        xchg %ebp, %esp         # mov %esi, (%ebp) approach was 
        pop %esi                # 1 byte longer.
        jmp next
exit:   xchg %ebp, %esp         # Return from a colon defn.
        pop %esi
        xchg %ebp, %esp
        jmp next
execute:                        # Run xt on data stack.
        ret                     # Here 'xt' is the actual addr,
                                # not the one-byte token.

# Branch if top of stack is 0 (implementing IF).
# Both branch instructions take a signed byte offset from the bytecode
# stream.
branch_on_0:
        pop %eax
        and %eax, %eax
        jz branch
        inc %esi                # skip jump offset
        jmp next
branch:
        lodsb
        movsbl %al, %eax        # same insn size as cbtw; cwde
        add %eax, %esi
        jmp next        

# Store a cell.
bang:   pop %ebx
        pop (%ebx)              # I'm amazed this is legal
        jmp next
# Fetch a cell.
at:     pop %ebx
        push (%ebx)             # XXX illegal
        jmp next

# Store a byte.
c_bang: pop %ebx
        pop %eax
        mov %al, (%ebx)         # push and pop don't do bytes
        jmp next
# Fetch a byte.
c_at:   pop %ebx
        xor %eax, %eax
        mov (%ebx), %al
        push %eax
        jmp next
# Get the return stack pointer.
rp_at:  push %ebp
        jmp next
# Set the return stack pointer.
rp_bang:
        pop %ebp
        jmp next
# Pop the return stack to the data stack ( R> )
rpop:   push (%ebp)
        lea 4(%ebp), %ebp       # add or xchg/pop: same size
        jmp next
# Push the return stack from the data stack ( >R )
rpush:  lea -4(%ebp), %ebp
        pop (%ebp)
        jmp next

# Get the data stack pointer (before it gets pushed).
sp_at:  push %esp               # safe on 286 and later
        jmp next
# Set the data stack pointer.
sp_bang:
        pop %esp
        jmp next
# Pop the stack.
drop:   pop %eax
        jmp next
# Push a copy of TOS.
# eForth 1.0 used BX to index the stack here, for a couple of
# reasons: on the 8086, SP got decremented prior to the fetch,
# and also wasn't valid as a base or index register.
dup:    push (%esp)
        jmp next
# Swap top two stack items ("exch" in PostScript)
swap:   pop %eax
        pop %ebx
        push %eax
        push %ebx
        jmp next
# Stack manipulation ( w1 w2 -- w1 w2 w1 )
# technically not necessary, but it's so easy and tiny
over:   push 4(%esp)           
#        jmp next               fall through because "next" is next
        
# "next" fetches the next bytecode and runs it.  It's placed
# here in the middle of the bytecode definitions so that more
# of them can use the short two-byte jump form to get to it.

next:
        xor %eax, %eax          # set %eax to 0
        lodsb                   # load %al from where %esi points
                                # (%esi is the interpreter pointer)
        lea token_table(,%eax,4), %eax  # eax := token_table + eax * 4bytes
        jmp *(%eax)             # now %eax points at an address in the token
                                # table; so we fetch that address from the
                                # token table and jump to it.

# Push true if n negative. ( n -- f )
negative:
        pop %eax                
        cdq
        push %edx
        jmp next
# Bitwise operators:  
and:    pop %eax
        pop %ebx
        and %ebx, %eax
        push %eax
        jmp next
or:     pop %eax
        pop %ebx
        or %ebx, %eax
        push %eax
        jmp next
xor:    pop %eax
        pop %ebx
        xor %ebx, %eax
        push %eax
        jmp next
# add two unsigned numbers, returning sum and carry.
# ( u1 u2 -- u3 cy )
umplus:
        xor %ecx, %ecx
        pop %eax
        pop %ebx
        add %ebx, %eax
        rcl $1, %ecx
        push %eax
        push %ecx
        jmp next

# Copy the top of the return stack onto the data stack.
# May be the traditional Forth word "I".
r_at:   push (%ebp)
        jmp next

# syscall5:   
# Linux system call with up to 5 arguments
# This is no longer the fashionable way to make system calls
# in Linux.  Now you're supposed to use SYSENTER on newer
# CPUs, and rather than have you figure out which one to use,
# the kernel mmaps a chunk of code called a VDSO into your
# memory space at a random address and tells you where to
# find it using the ELF auxiliary vector.  Then you're
# supposed to invoke the dynamic linker or something to parse
# the ELF executable mysteriously manifested in this way by
# the kernel, and then resolve an undefined symbol in libc
# into calls to it.  See "What is linux-gate.so.1?"
# http://www.trilithium.com/johan/2005/08/linux-gate/
# "The Linux kernel: System Calls" by Andries Brouwer, 2003-02-01
# http://www.win.tue.nl/%7Eaeb/linux/lk/lk-4.html
# "About ELF Auxiliary Vectors" by Manu Garg
# http://manugarg.googlepages.com/aboutelfauxiliaryvectors

# But the old int $0x80 approach still works, thank goodness,
# because all of that is *way* more than these ten
# instructions.
syscall5:
        pop %edi
        ## we have to save %esi for the interpreter
        mov %esi, -4(%ebp)
        pop %esi
        pop %edx
        pop %ecx
        pop %ebx
        pop %eax
        int $0x80
        push %eax
        mov -4(%ebp), %esi
        jmp next

### Basic Interpreted Words

# System call with three arguments.
syscall3:
        call dolist
        .byte 35, 35, 33, 9     # push two 0s, syscall5, ret
# System call with one argument.
syscall1:       
        call dolist
        .byte 35, 35, 37, 9     # push two 0s, syscall3, ret
bye:    call dolist
        .byte 6,__NR_exit, 35, 34, 9
zero:   call dolist             # push 0
        .byte 6,0, 9
one:    call dolist             # push 1
        .byte 6,1, 9

        
# This word outputs a string whose address and count are on 
# the stack.  ( b u -- )
# "call dolist" magically treats whatever follows as bytecode.
# This confuses disassemblers.
type:    call dolist
        .byte 20, 20            # move two args onto rstack
        .byte 6,__NR_write      # system call is __NR_write
        .byte 36                # push constant 1: stdout
        .byte 19, 19            # move two args back from rstack
        .byte 37                # call syscall with 3 args
        .byte 23                # discard return value
        .byte 9                 # return

# The last few words exist just to poke string addresses
# and lengths onto the stack so "type" can print them.
# (I probably should have just written a gas macro...)
hello:  call dolist
        .byte 7                 # dolit_32 pushes a 32-bit
        .int hello_string       # literal, an addr here
        .byte 6,5, 9            # push literal 5 and return
        .section .rodata        # define the actual string:
hello_string:
        .ascii "hello"
        .text

world:  call dolist
        .byte 7
        .int world_string
        .byte 6,5, 9    
        .section .rodata
world_string:
        .ascii "world"
        .text

comma:  call dolist
        .byte 7
        .int comma_string
        .byte 6,2, 9            # comma string only 2 in length
        .section .rodata
comma_string:
        .ascii ", "
        .text

newline:        
        call dolist
        .byte 7
        .int newline_string
        .byte 6,1, 9
        .section .rodata
newline_string:
        .ascii "\n"
        .text

Reply via email to