# Simple direct-token-threaded "Forth" interpreter. -*- asm -*-
# by Kragen Javier Sitaker; dedicated to the public domain,
# i.e. I relinquish whatever exclusive rights copyright law
# gives me with regard to this work.
# Major parts taken from Richard W.M. Jones's public-domain
# JONESFORTH 42 by Richard W.M. Jones <[EMAIL PROTECTED]>
# http://annexia.org/forth
# This program just outputs "hello, world, hello" under Linux.
# to compile:
# gcc -m32 -nostdlib -static -o tokthr tokthr.S
### Why Small Things are Interesting
# There are still a lot of computers out there that have tens of
# kilobytes of memory or less, and they cost a lot less than,
# say, a cellphone. Cellphones are apparently still too
# expensive for half the world's population. I want to see how
# close I can get to having a comfortable programming
# environment in a smaller device.
# Some smallish microcontroller chips from five different
# manufacturers, with current Digi-Key prices:
# Name bytes RAM bytes ROM MHz price
# ATtiny2313 128 2048 20 US$1.38
# ATMega48-20AU 512 4096 20 US$1.62
# MSP430F1111AIPW 128 2264 8 US$2.43
# LPC2101 2048 8192 70 US$2.52
# H8/300H Tiny 1536 8192 12 US$3.58
# M16C/R8C/Tiny/1B 1024 16384 12 US$3.54
# SX28AC/SS 136 3072 50 US$2.79
# More practically and short-termly, small projects can take
# less time to finish, and I feel like I need to learn about
# different approaches to implementing programming languages.
### Why This is Small
# The normal Forth representation of a function is as an array
# of pointers to the other functions it calls, in sequence; a
# few of those other functions may move the interpreter pointer
# around in that array, or snarf up a constant that's stored in
# the array, or stuff like that, but for the most part, the
# functions just get called in sequence. This is called
# "threaded code" and it's fairly compact, especially on 16-bit
# systems where the pointers are only two bytes.
# A traditional approach taken by Forth implementations to
# reduce code size even further is called "token threading".
# Rather than making arrays of 16-bit or 32-bit pointers, they
# make lists of 8-bit indices into an array of pointers. This
# has two advantages:
# 1. the indices are one fourth the size of a list of 32-bit
# pointers;
# 2. it is possible to save these lists of indices somewhere
# outside of memory and continue to use them even after
# making some changes to the code, as long as the same
# indices in the table have the same meanings. So, for
# example, you could write some boot firmware in this
# "bytecode".
# It also has some disadvantages:
# 1. You run out of space in the table. Even a fairly minimal
# full Forth system contains close to 256 subroutines. You
# can mitigate this by packing, say, two 12-bit pointers
# every three bytes, or maybe by having a special bytecode
# that looks up the next byte in an extended table.
# 2. It's slower and makes the machine-code part of the program
# take more space. The traditional LODSW; JMP AX version of
# $NEXT from the eForth Model, which fetches and executes the
# next execution token in the threaded list, is three bytes
# and two instructions; my 'next' here is 12 bytes and four
# instructions, which is big enough that I jump to it (5
# bytes) rather than making an assembler macro. Which blows
# your branch target buffers to pieces. Oh well. I measured
# the speed hit on my 700MHz PIII laptop on an empty loop at
# about a factor of 3.5 over simple direct-threading, which
# in turn is on the order of 10 times slower than machine
# code.
# Anyway, so this is an example program built using this
# technique. It implements two Forthlike stacks and interpreted
# subroutines, but not yet the ability to define new subroutines
# at run-time.
### What's Here
# I've implemented all of the primitives from C. H. Ting and
# Bill Muench's public-domain (?) eForth Model 1.0, except for
# the following:
# - I haven't implemented their lowercase "next" (as in FOR
# NEXT) because I think it's a bad idea, it's complex, and it
# can be implemented at a higher level if you really need it;
# - I didn't implment !IO because it's a no-op in this context;
# - I haven't yet implemented ?RX, although I think it's
# possible to implement it on top of syscall5, using select();
# - I haven't implemented TX!, although it would be
# straightforward to implement on top of "type".
# However, most of it is untested and therefore probably broken.
# Procedure call and return and the system calls do work.
# Currently registers are used as follows:
# %esi --- interpreter pointer; points to next byte to execute
# %ebp --- return stack pointer; points to last thing pushed
# %esp --- data stack pointer; points to last thing pushed
# flags --- the "down" direction flag must be cleared.
# Fortunately this seems to be the case by default.
# I'll probably try the colorForth thing of keeping the top of
# the data stack in a register and see if that makes things
# smaller. It probably won't make them faster, since most of
# this interpreter's time will be spent in the interpreting
# loop anyway.
# It's probably missing a couple of primitives needed because of
# the token-threading implementation strategy; the address of
# the token table probably needs to be knowable, at least.
### How Small This Is
# The pure machine-code part of the program runs from 0x08048074
# to 0x08048139, which is 197 bytes. This is fairly small. I
# haven't been able to compile or disassemble eForth 1.0 to see
# how big its machine-code part is, but it seemed to be 171
# instructions, while this is 112. So this is fairly small.
# It uses only 19 different instructions: jmp, jz; push, pop,
# lodsb, lodsl, xchg, mov; movsbl, inc, and, xor, or, lea,
# cdq, rcl; int, ret, and call.
# The bytecode part of the program, including the five-byte
# machine-code header on each word, runs from 0x08048139 to
# 0x080481a8, which is another 111 bytes. At present this just
# includes some system-call glue, a word to exit cleanly, words
# to push literal zero and one, a word to output a string, and a
# few constant strings.
# These two parts add up to 308 bytes.
# Additionally there is some read-only data: 38 pointers to
# words, 13 bytes of "main program", and then some string data:
# "helloworld, \n". This totals 178 bytes.
# The whole program, therefore, is 486 bytes. Why "strip" only
# reduces it to 836 bytes, I don't know. But I don't think that
# problem will carry over to microcontroller contexts.
# One important thing that's missing here is the dictionary
# structure, which will minimally use up another hundred bytes
# or so.
### What's Wrong With This Program
# - It's a long way from doing anything useful.
# - The bytecode is numeric and unreadable, and it confuses
# disassemblers.
# - Most of the primitives are untested and therefore probably
# broken, except for dolit_s8, dolit_32, dolist, exit, and
# syscall5.
# - There's no dictionary structure yet.
# - It probably needs another couple of primitives.
# - There's no checking for stack overflow or underflow, but
# they will break things.
### The Beginning of the Program
# .. include system library header so we know __NR_exit = 1 and
# __NR_write = 4
#include <asm/unistd.h>
### The token table
.section .rodata # stuff goes in ELF's read-only data section
.align 4
token_table: # table to define the "bytecode" instructions
## These are pointers to machine-code subroutines
## 0 1 2 3 4 5
.int bye, type, hello, world, comma, newline
## 6 7 8 9 10
.int dolit_s8, dolit_32, dolist, exit, execute
## 11 12 13 14 15 16 17
.int branch_on_0, branch, bang, at, c_bang, c_at, rp_at
## 18 19 20 21 22 23 24
.int rp_bang, rpop, rpush, sp_at, sp_bang, drop, dup
## 25 26 27 28 29 30 31 32
.int swap, over, negative, and, or, xor, umplus, r_at
## 33 34 35 36 37
.int syscall5, syscall1, zero, one, syscall3
instructions:
# And here is the actual "program" in that bytecode.
.byte 2, 1, 4, 1, 3, 1, 4, 1, 2, 1, 5, 1, 0
### The Return Stack
# We put Forth return addresses here, but programs can also use
# it for other purposes.
.bss
.space 4096
initial_return_stack_pointer:
### Initialization
.text # the following stuff goes in the text segment
.globl _start # declare _start as a global symbol
# (otherwise ld won't be able to find it)
_start: # this is the entry point for ELF I guess
mov $initial_return_stack_pointer, %ebp
mov $instructions, %esi # %esi is the interpreter pointer register
jmp next # and now we start the interpreter.
# (somewhat silly since we could just
# fall through..)
### The Machine-Code Primitives
# Also dolist (aka DOCOL) and next (aka the address interpreter
# or inner interpreter) are in this section.
# dolit_s8 takes a signed 8-bit literal from the instruction
# stream and pushes it onto the stack.
dolit_s8:
lodsb
movsbl %al, %eax
push %eax
jmp next
dolit_32: # more general dolit
lodsl
push %eax
jmp next
# Process colon list by popping saved %eip from stack and
# putting it into %esi, saving old %esi on return stack.
dolist: # approach from eForth 1.0
xchg %ebp, %esp # 8 bytes, 5 insns.
push %esi # The sub $4, %ebp (or lea)
xchg %ebp, %esp # mov %esi, (%ebp) approach was
pop %esi # 1 byte longer.
jmp next
exit: xchg %ebp, %esp # Return from a colon defn.
pop %esi
xchg %ebp, %esp
jmp next
execute: # Run xt on data stack.
ret # Here 'xt' is the actual addr,
# not the one-byte token.
# Branch if top of stack is 0 (implementing IF).
# Both branch instructions take a signed byte offset from the bytecode
# stream.
branch_on_0:
pop %eax
and %eax, %eax
jz branch
inc %esi # skip jump offset
jmp next
branch:
lodsb
movsbl %al, %eax # same insn size as cbtw; cwde
add %eax, %esi
jmp next
# Store a cell.
bang: pop %ebx
pop (%ebx) # I'm amazed this is legal
jmp next
# Fetch a cell.
at: pop %ebx
push (%ebx) # XXX illegal
jmp next
# Store a byte.
c_bang: pop %ebx
pop %eax
mov %al, (%ebx) # push and pop don't do bytes
jmp next
# Fetch a byte.
c_at: pop %ebx
xor %eax, %eax
mov (%ebx), %al
push %eax
jmp next
# Get the return stack pointer.
rp_at: push %ebp
jmp next
# Set the return stack pointer.
rp_bang:
pop %ebp
jmp next
# Pop the return stack to the data stack ( R> )
rpop: push (%ebp)
lea 4(%ebp), %ebp # add or xchg/pop: same size
jmp next
# Push the return stack from the data stack ( >R )
rpush: lea -4(%ebp), %ebp
pop (%ebp)
jmp next
# Get the data stack pointer (before it gets pushed).
sp_at: push %esp # safe on 286 and later
jmp next
# Set the data stack pointer.
sp_bang:
pop %esp
jmp next
# Pop the stack.
drop: pop %eax
jmp next
# Push a copy of TOS.
# eForth 1.0 used BX to index the stack here, for a couple of
# reasons: on the 8086, SP got decremented prior to the fetch,
# and also wasn't valid as a base or index register.
dup: push (%esp)
jmp next
# Swap top two stack items ("exch" in PostScript)
swap: pop %eax
pop %ebx
push %eax
push %ebx
jmp next
# Stack manipulation ( w1 w2 -- w1 w2 w1 )
# technically not necessary, but it's so easy and tiny
over: push 4(%esp)
# jmp next fall through because "next" is next
# "next" fetches the next bytecode and runs it. It's placed
# here in the middle of the bytecode definitions so that more
# of them can use the short two-byte jump form to get to it.
next:
xor %eax, %eax # set %eax to 0
lodsb # load %al from where %esi points
# (%esi is the interpreter pointer)
lea token_table(,%eax,4), %eax # eax := token_table + eax * 4bytes
jmp *(%eax) # now %eax points at an address in the token
# table; so we fetch that address from the
# token table and jump to it.
# Push true if n negative. ( n -- f )
negative:
pop %eax
cdq
push %edx
jmp next
# Bitwise operators:
and: pop %eax
pop %ebx
and %ebx, %eax
push %eax
jmp next
or: pop %eax
pop %ebx
or %ebx, %eax
push %eax
jmp next
xor: pop %eax
pop %ebx
xor %ebx, %eax
push %eax
jmp next
# add two unsigned numbers, returning sum and carry.
# ( u1 u2 -- u3 cy )
umplus:
xor %ecx, %ecx
pop %eax
pop %ebx
add %ebx, %eax
rcl $1, %ecx
push %eax
push %ecx
jmp next
# Copy the top of the return stack onto the data stack.
# May be the traditional Forth word "I".
r_at: push (%ebp)
jmp next
# syscall5:
# Linux system call with up to 5 arguments
# This is no longer the fashionable way to make system calls
# in Linux. Now you're supposed to use SYSENTER on newer
# CPUs, and rather than have you figure out which one to use,
# the kernel mmaps a chunk of code called a VDSO into your
# memory space at a random address and tells you where to
# find it using the ELF auxiliary vector. Then you're
# supposed to invoke the dynamic linker or something to parse
# the ELF executable mysteriously manifested in this way by
# the kernel, and then resolve an undefined symbol in libc
# into calls to it. See "What is linux-gate.so.1?"
# http://www.trilithium.com/johan/2005/08/linux-gate/
# "The Linux kernel: System Calls" by Andries Brouwer, 2003-02-01
# http://www.win.tue.nl/%7Eaeb/linux/lk/lk-4.html
# "About ELF Auxiliary Vectors" by Manu Garg
# http://manugarg.googlepages.com/aboutelfauxiliaryvectors
# But the old int $0x80 approach still works, thank goodness,
# because all of that is *way* more than these ten
# instructions.
syscall5:
pop %edi
## we have to save %esi for the interpreter
mov %esi, -4(%ebp)
pop %esi
pop %edx
pop %ecx
pop %ebx
pop %eax
int $0x80
push %eax
mov -4(%ebp), %esi
jmp next
### Basic Interpreted Words
# System call with three arguments.
syscall3:
call dolist
.byte 35, 35, 33, 9 # push two 0s, syscall5, ret
# System call with one argument.
syscall1:
call dolist
.byte 35, 35, 37, 9 # push two 0s, syscall3, ret
bye: call dolist
.byte 6,__NR_exit, 35, 34, 9
zero: call dolist # push 0
.byte 6,0, 9
one: call dolist # push 1
.byte 6,1, 9
# This word outputs a string whose address and count are on
# the stack. ( b u -- )
# "call dolist" magically treats whatever follows as bytecode.
# This confuses disassemblers.
type: call dolist
.byte 20, 20 # move two args onto rstack
.byte 6,__NR_write # system call is __NR_write
.byte 36 # push constant 1: stdout
.byte 19, 19 # move two args back from rstack
.byte 37 # call syscall with 3 args
.byte 23 # discard return value
.byte 9 # return
# The last few words exist just to poke string addresses
# and lengths onto the stack so "type" can print them.
# (I probably should have just written a gas macro...)
hello: call dolist
.byte 7 # dolit_32 pushes a 32-bit
.int hello_string # literal, an addr here
.byte 6,5, 9 # push literal 5 and return
.section .rodata # define the actual string:
hello_string:
.ascii "hello"
.text
world: call dolist
.byte 7
.int world_string
.byte 6,5, 9
.section .rodata
world_string:
.ascii "world"
.text
comma: call dolist
.byte 7
.int comma_string
.byte 6,2, 9 # comma string only 2 in length
.section .rodata
comma_string:
.ascii ", "
.text
newline:
call dolist
.byte 7
.int newline_string
.byte 6,1, 9
.section .rodata
newline_string:
.ascii "\n"
.text