As of version 3, the machine-code part of this interpreter is
239 bytes in 129 machine-code instructions, plus 19 bytes of
read-only data, 172 bytes of token table for the 86
currently-defined words (30 of which are primitive), 397
bytes of their names, and another 370 bytes of bytecode in
the 56 bytecode words (including a couple of data values
scattered around), and then another 56 bytes in the main
program, for a total of 1253 bytes.

I'm almost to the point of being able to parse "words separated by
spaces".

# Simple token-threaded "Forth" interpreter. -*- asm -*-
# Version 3.
# by Kragen Javier Sitaker; dedicated to the public domain,
# i.e. I relinquish whatever exclusive rights copyright law
# gives me with regard to this work.
# Major parts taken from Richard W.M. Jones's public-domain
# JONESFORTH 42 by Richard W.M. Jones <[EMAIL PROTECTED]>
# http://annexia.org/forth

# As of version 2, this program just outputs the following under
# Linux:
# hell, world, hello
# -120 1 104 
# s: 102 101 100 
# 397 
# lit8 lit ret execute (if) (else) (loop) ! @ c! c@ rp@ rp! r> >r sp@ sp! pop 
dup over swap 0< & | ^ um+ /% um* r@ syscall5 syscall3 syscall1 bye 0 1 type 
hello world comma count cr -1 + 1- rot -rot tuck emit u. (u.) ((u.)) ~ 1+ 
negate . scolon .s pick cells * depth - / cellsize nip (do) i 2dup 2drop dict 
dictp dictsize nextword pastdict? words < >= 0= cbcmp [EMAIL PROTECTED] bcmp 
memcmp unloop find r2@ 2swap 
# hello134517863 
# , 0 
# 134517710 

# to compile:   
# gcc -m32 -nostdlib -static -o tokthr2 tokthr2.S


### Why Small Things are Interesting

# There are still a lot of computers out there that have tens of
# kilobytes of memory or less, and they cost a lot less than,
# say, a cellphone.  Cellphones are apparently still too
# expensive for half the world's population.  I want to see how
# close I can get to having a comfortable programming
# environment in a smaller device.

# Some smallish microcontroller chips from five different
# manufacturers, with current Digi-Key prices:
# Name              bytes RAM  bytes ROM  MHz  price    
# ATtiny2313        128        2048       20   US$1.38  
# ATMega48-20AU     512        4096       20   US$1.62  
# MSP430F1111AIPW   128        2264       8    US$2.43  
# LPC2101           2048       8192       70   US$2.52  
# H8/300H Tiny      1536       8192       12   US$3.58  
# M16C/R8C/Tiny/1B  1024       16384      12   US$3.54  
# SX28AC/SS         136        3072       50   US$2.79  

# There are essentially no 386-compatible devices in this price
# range as far as I can tell; for why I'm not worried about
# that, see the section "How Small This Is".

# More practically and short-termly, small projects can take
# less time to finish, and I feel like I need to learn about
# different approaches to implementing programming languages.

### Why This is Small

# The normal Forth representation of a function is as an array
# of pointers to the other functions it calls, in sequence; a
# few of those other functions may move the interpreter pointer
# around in that array, or snarf up a constant that's stored in
# the array, or stuff like that, but for the most part, the
# functions just get called in sequence.  This is called
# "threaded code" and it's fairly compact, especially on 16-bit
# systems where the pointers are only two bytes.

# A traditional approach taken by Forth implementations to
# reduce code size even further is called "token threading".
# Rather than making arrays of 16-bit or 32-bit pointers, they
# make lists of 8-bit indices into an array of pointers.  This
# has two advantages:

# 1. the indices are one fourth the size of a list of 32-bit
#    pointers;
# 2. it is possible to save these lists of indices somewhere
#    outside of memory and continue to use them even after
#    making some changes to the code, as long as the same
#    indices in the table have the same meanings.  So, for
#    example, you could write some boot firmware in this
#    "bytecode".

# It also has some disadvantages:
# 1. You run out of space in the table.  Even a fairly minimal
#    full Forth system contains close to 256 subroutines.  You
#    can mitigate this by packing, say, two 12-bit pointers
#    every three bytes, or maybe by having a special bytecode
#    that looks up the next byte in an extended table.
# 2. It's slower and makes the machine-code part of the program
#    take more space.  The traditional LODSW; JMP AX version of
#    $NEXT from the eForth Model, which fetches and executes the
#    next execution token in the threaded list, is three bytes
#    and two instructions; my 'next' here is 41 bytes and 14
#    instructions, which is big enough that I jump to it (2
#    bytes) rather than making an assembler macro.  Which blows
#    your branch target buffers to pieces.  Oh well.  The
#    performance penalty is probably two orders of magnitude
#    over native code, but I haven't measured it yet.  I
#    measured an earlier version on my 700MHz PIII laptop on an
#    empty loop at about a factor of 3.5 over simple
#    direct-threading, which in turn is on the order of 10 times
#    slower than machine code.

# Anyway, so this is an example program built using this
# technique.  It implements two Forthlike stacks and interpreted
# subroutines, but not yet the ability to define new subroutines
# at run-time.

### What's Here

# I've implemented all of the primitives from C. H. Ting and
# Bill Muench's public-domain (?) eForth Model 1.0, except for
# the following:
# - I haven't implemented their lowercase "next" (as in FOR
#   NEXT) because I think it's a bad idea, it's complex, and it
#   can be implemented at a higher level if you really need it;
#   instead, I implemented (loop).        
# - I didn't implement !IO because it's a no-op in this context;
# - I haven't yet implemented ?RX, although I think it's
#   possible to implement it on top of syscall5, using select();
# Additionally, I implemented multiply and divide primitives.

# However, some of it is untested and therefore probably broken.
# Procedure call and return and the system calls do work.

# Currently registers are used as follows:
# %esi --- interpreter pointer; points to next byte to execute
# %ebp --- return stack pointer; points to last thing pushed.  This stack, 
#          like the other one, grows downwards.
# %esp --- data stack pointer; points to last thing pushed.  This is
#          the processor's standard stack pointer; "push" and "pop"
#          instructions use it, which makes assembly code to use it
#          quite concise.  The Intel "call" and "ret" instructions
#          would also use this stack, but they aren't used in this
#          program.
# flags --- the "down" direction flag must be cleared.

# It's probably missing a couple of primitives needed because of
# the token-threading implementation strategy; the address of
# the token table probably needs to be knowable, at least.

# Direct and indirect threading, the normal Forth approaches to
# allowing unrestricted coexistence of words written in assembly
# language and interpreted Forth, both had heavy space costs
# here --- close to 100% for the bytecode currently in the
# system.  So the inner interpreter checks, for each bytecode,
# whether it is in the range of bytecodes whose interpretations
# are in native code, and picks the relevant code path.  This
# avoids consuming any space per-word for this distinction, but with
# what appears to be a heavy performance cost.

# This is similar to an approach called "bit threading"; as
# explained by Jeff Fox in a comp.lang.forth message
# 2007-05-13, on thread "Cases where Forth seems a little
# clunky":

#       I have seen hardware and software implemenations of
#       bit-threading where the msb of the address space
#       selects between threaded code address lists and
#       addresses of CODE subroutines. In both cases 0 is a
#       valid address and negative addresses are valid. I think
#       this applied to Novix.

# Except that this approach uses a fraction of a bit for most
# tokens instead of a whole bit.

### How Small This Is

# As a point of comparison, eForth 1.0's machine-code part seems
# to be 171 instructions and 399 bytes, including some data
# that's mixed in there with it.

# As of version 3, the machine-code part of this interpreter is
# 239 bytes in 129 machine-code instructions, plus 19 bytes of
# read-only data, 172 bytes of token table for the 86
# currently-defined words (30 of which are primitive), 397
# bytes of their names, and another 370 bytes of bytecode in
# the 56 bytecode words (including a couple of data values
# scattered around), and then another 56 bytes in the main
# program, for a total of 1253 bytes.  So the program is
# already less than half written in assembly, in terms of
# object-code size.

# As of version 3, the machine-code part of this program uses
# only 25 different instructions: cld; jmp, jz, jnc, jbe; push,
# pop, lodsb, lodsl, xchg, mov; cmp, movsbl, inc, and, xor, or,
# lea, cdq, rcl, add, idiv, imul; and int.  Interestingly, many
# of them are only used once.

# Just before version 3, the non-comment lines were 14%
# assembler macros, 51% assembly language, and 35% bytecode.

# From Brad Rodriguez's January/February 1993 Computer Journal
# article, "Moving Forth: Part 1: Design Decisions in the Forth
# Kernel":
#       You can expect the usual 16-bit Forth kernel (see below)
#       to occupy about 8K bytes of program space. For a full
#       kernel that can compile Forth definitions, you should
#       allow a minimum of 1K byte of RAM.

# I'm pretty sure I can beat the 8K requirement by quite a bit
# and still be able to compile Forth definitions --- I'm hoping
# for a factor of 4.  Consider, from Jeff Fox's "Thoughtful
# Programming", chapter 3: http://www.ultratechnology.com/forth3.htm

#       People assume that since Chuck has refined his Forth down to
#       about a 1K object that this means he has just stripped his
#       Forth down to a 1K kernel that will boot like in the old days
#       and that he is going to compile a complete Forth system on top
#       of the 1K before he starts an application. This is wrong. The
#       complete Forth system is 1K, and the reason for that is
#       maximize Chuck's productivity. What stops people from doing
#       what they need to do to solve a problem is all the time spend
#       solving all the related sub-problems that pop up as a result
#       of complex interconnections between components. To maximize
#       his productivity Chuck minimizes the number of these side
#       problems that pop up. Keep it simple, and don't get to where
#       you are spending 90% or 99% or you time dealing with related
#       sub-problems. Avoid unsolvable problems, don't waste your time
#       trying to solve them.

# Consider also this quote from Elizabeth Rather:

#       As you have seen, so much depends on the specific
#       machine architecture. We implemented a TTC [token
#       threaded] Forth on some low-end AVR processors with
#       very limited code space, and it ran faster than a
#       native-code 8051 at comparable clock speed.

# (2007-06-23 comp.lang.forth post on thread "Build your own
# Forth for Microchip PIC (Episode 838): Threading")
# 
http://objectmix.com/forth/168105-build-your-own-forth-microchip-pic-episode-838-threading.html

# Consider also this quote from the abstract of Frank
# Sergeant's 1991 "A 3-Instruction Forth for Embedded Systems
# Work: Illustrated on the Motorola MC68HC11":

#       A 3- instruction Forth makes Forth affordable for
#       target systems with very limited memory. It can be
#       brought up quickly on strange new hardware. You don't
#       have to do without Forth because of memory or time
#       limitations. It only takes 66 bytes for the Motorola
#       MC68HC11. Full source is provided. . . . The absolute
#       minimum the target must do, it seems to me, is fetch a
#       byte, store a byte, and call a subroutine.

# http://pygmy.utoh.org/3ins4th.html

# As I said before, there are no small, cheap 386s, and so of
# course the code size of this version is an approximation, and
# it won't be a simple recompile to port it to one of these
# other architectures; but 129 instructions' worth of assembly
# are probably not that big a deal to rewrite for a new
# platform.  (I'll probably want to write multiply and divide
# routines, though.)

### Other Things I Tried

# I tried switching to caching the top of the data stack in a
# register, on the theory that it would shorten things like
# 'and'.  Currently 'and' is pop %eax; pop %ebx; and %ebx,
# %eax; push %eax; jmp next.  If top of stack is cached in %eax
# instead of being stored in memory, this becomes pop %ebx; and
# %ebx, %eax; jmp next, which is considerably shorter.
# However, most things don't change, and other things become
# longer due to the extra work to save top-of-stack.  I tried
# using both %ebx and %eax as the top-of-stack cache.

# In the version using %ebx as top-of-stack, the total size of
# the machine-code part was 216 bytes, 115 instructions,
# compared to 197 bytes, 112 instructions for the version using
# the current strategy.  In the version using %eax as
# top-of-stack, it was only 215 bytes, but that's still worse
# than 197.

# In previous versions, all routines were machine-code routines
# that you could just jmp to.  High-level bytecode words began
# with "call dolist", which took the saved %eip off the stack
# and stuck it in %esi.  Unfortunately, that added 5 bytes to
# each bytecode word; in version 2, the bytecode region is 221
# bytes and contains 36 word definitions (the 30 machine-code
# primitives aren't defined there) --- so 5 bytes each would
# have been 180 bytes of overhead, or 82%!  It also would have
# required a region to be both executable and writable to
# support run-time routine definition, which is kind of a pain
# thse days, and also on Harvard-architecture microcontrollers.

# In previous versions, the token-table entries were 32 bits
# each (instead of 16 bits as they are now), which added
# another 2 bytes of overhead per word.  In version 2, there
# are currently 66 words, so that's another 132 bytes of shaved
# overhead.

### Performance (Speed)

# To version 2, I added a simple program to print out a
# badly-formatted 8-bit extended ASCII table; it was 11
# bytecode operations long.  It executed 81615 bytecodes in all
# (according to gdb).  On my 700MHz PIII-Coppermine laptop from
# 1999, 'time' reports CPU times varying from 4-12
# milliseconds.  So it seems like it can execute about 10 000
# bytecodes in a millisecond, or about 10 million bytecodes in
# a second --- about 100ns per bytecode.

# It also made 1459 system calls.

# That's slightly faster than Python's bytecode engine (maybe
# --- probably within measurement error --- and anyway Python's
# bytecodes are larger-grained), and an order of magnitude
# slower than simple direct-threading, and about three times
# slower than direct-threading with a simple bytecode
# indirection layer.

# I would be surprised if version 3 had any measurable change
# in speed from version 2.

# To version 3, I added a little loop to repeatedly search the
# 86-word wordlist for ", ", which isn't there.  I was able to
# do this search 10 000 times in 4.59 CPU seconds, which is
# 5.34 microseconds per comparison, or about 3700 CPU cycles.
# If we eventually have 300 words in the wordlist, each search
# will therefore take 1.6 milliseconds in the worst case, so
# the system won't be able to compile or interpret more than
# about 50-100 lines of code per second.  It should probably be
# pretty easy to fix this particular performance problem
# (recode wordlist search in asm, restructure, or whatever) if
# it comes up, but it's a bit worrisome...

### What's Wrong With This Program

# - It's a long way from doing anything useful.
# - There's 21 instructions of unused code which may be broken.
#   In Version 3, these words are in use and so probably work:
#   hello, sub1, type, comma, world, newline, dolit_s8, dot,
#   bye, exit, branch_on_0, c_bang, drop, dup, swap, negative,
#   umplus, divmod, syscall5, syscall3, zero, syscall1, rpop,
#   rpush, one, dolit_32, neg1, add, emit, tuck, udot,
#   udot_nospc, udot_nonzero, branch, invert, add1, negate,
#   xor, printstack, cells, mul, depth, div, sub, cellsize,
#   nip, pick, twodup, i, _do, r_at, over, at, sp_at, bang,
#   twoswap, r_2at, find, unloop, memcmp, bcmp, c_at_inc,
#   cbcmp, zeq, ge, lt, words, pastdict, nextword, dictsize,
#   dictp, dict, c_at.  These are not tested, and therefore may be
#   broken: execute, rp_at, rp_bang, sp_bang, and, or.
# - There's no dictionary structure yet.
# - It probably needs another couple of primitives.
# - There's no checking for stack overflow or underflow, but
#   they will break things.
# - It's slow; see above about performance.

### The Beginning of the Program

# .. include system library header so we know __NR_exit = 1 and
# __NR_write = 4
#include <asm/unistd.h>

### The token table and dictionary

        # To save space, we're trying to avoid storing pointers
        # as much as possible.  So most of the code is
        # represented as "tokens", which are offsets into the
        # "token table", which contains 16-bit offsets into
        # either the machine-code primitives or the data
        # segment.

        # The "dictionary" is stored in this program as a
        # sequence of strings, stuffed next to each other with
        # no intervening pointers at all.  The idea is that if
        # you want to execute or compile a word, you walk
        # through the dictionary, examining each string in turn
        # to see if it's the right word.  If so, then the
        # number of strings you rejected is the index into the
        # token table.

        # This implies that the dictionary needs to be stored
        # somewhere where it can expand without overwriting
        # other stuff.  So I'm putting it in its own section
        # for the time being.  I'm pretty sure that means it'll
        # get at least a page, which is enough space to define
        # at least several hundred words.  Maybe at some future
        # time I'll copy it into .bss at boot time instead.

        .data 1                 # Start putting stuff in data subsection 1
        .align 4
        ## table to define the "bytecode" instructions
token_table:
        # "a" means "allocatable", "w" means "writable"
        .section .dictionary, "aw"
dictionary:

        ## I was frustrated with the unreadability of my
        ## bytecode lists; I was counting token table entries
        ## by hand and writing bytecodes numerically.  So I
        ## wrote a macro to help.

        ## Note that we are using a separate .subsection
        ## directive because gas 2.17 doesn't support putting
        ## that in the .pushsection line, even though it is
        ## documented to do so; see message from Maciej
        ## W. Rozycki on 2007-10-11, subject "Re: How to use
        ## .pushsection?",
        ## http://sourceware.org/ml/binutils/2007-10/msg00176.html
        ## for more details)

        ## The first few entries in the table of bytecodes are
        ## all defined in machine code; the rest are all
        ## defined in bytecode.  The inner interpreter examines
        ## each bytecode to determine which category it falls
        ## in in order to figure out how to execute it,
        ## including what base address to add its offset to.
        ## This sucks for extensibility but rocks for
        ## compactness.

        .macro countedstring name
        .byte stringlength\@
1:      .ascii "\name"
        ## Here we count the length of the string --- computers
        ## are for counting bytes so people don't have to!
        stringlength\@ = . - 1b
        .endm

        .macro define_bytecode name, realname, origin
        .pushsection .data      # save current position, go to data section
        .subsection 1           # and subsection 1, where we put the addrs
        b_\name = (. - token_table) / 2 # define b_foo as the index of this ptr
        .ifeq b_\name - 256
        .error "\name got bytecode 256"
        .endif
        .short \name - \origin # insert offset which will be resolved next
        .popsection             # return to where we were, and
        .pushsection .dictionary
        countedstring "\realname"
        .popsection
\name:                          # define the name
        .endm
        .macro defasm name, realname
        define_bytecode \name, "\realname", machine_code_primitives
        .endm
        .macro defbytes name, realname
        define_bytecode \name, "\realname", bytecode_start
        .endm

### The Return Stack
# We put Forth return addresses here, but programs can also use
# it for other purposes.

        .bss
        .space 4096
initial_return_stack_pointer:   

### Initialization
                
        .text                   # the following stuff goes in the text segment
        .global _start          # declare _start as a global symbol 
                                # (otherwise ld won't be able to find it)
_start:                         # this is the entry point for ELF I guess
        cld                     # clear direction flag; unnecessary?
        mov $initial_return_stack_pointer, %ebp
        mov $instructions, %esi # %esi is the interpreter pointer register
        jmp next                # and now we start the interpreter.
                                # (somewhat silly since we could just
                                # fall through..)

### The Machine-Code Primitives
# Also next (aka the address interpreter or inner interpreter)
# is in this section.

machine_code_primitives:
                
# dolit_s8 takes a signed 8-bit literal from the instruction
# stream and pushes it onto the stack.

        defasm dolit_s8, "lit8"
        lodsb
        movsbl %al, %eax
        jmp pusheax

        defasm dolit_32, "lit"  # more general dolit
        lodsl
        jmp pusheax
        
        defasm exit, "ret"      # Return from a colon defn.
        xchg %ebp, %esp
        pop %esi
        xchg %ebp, %esp
        jmp next

        defasm execute, "execute" # Run xt on data stack.
        pop %eax                # Here 'xt' is the one-byte token.
        jmp execute_eax

# Branch if top of stack is 0 (implementing IF).
# Both branch instructions take a signed byte offset from the bytecode
# stream.
        defasm branch_on_0, "(if)"
        pop %eax
        and %eax, %eax
        jz branch
        inc %esi                # skip 1-byte jump offset
        jmp next

        defasm branch, "(else)"
        lodsb
        movsbl %al, %eax        # same insn size as cbtw; cwde
        add %eax, %esi
        jmp next        

# (loop) is what we do at the end of a DO LOOP.

# DO puts a number on top of the return stack that is zero minus
# the number of iterations remaining.  So when it finally
# reaches zero, we're done.  It also put a number underneath
# that on the return stack from which you can recover the
# iteration counter.

# This scheme is mostly due to F-83, except that in F-83 it
# reached 0x8000 instead of 0, which seemed perverse to me.

# This isn't strictly necessary as part of the minimal primitive
# set, but it seemed like it would make inner loops maybe ten
# times faster, and as with most words that do return-stack
# manipulation, the size penalty is actually negative.  (This
# version is 14 bytes; in bytecode it would be 17.)

# ( -- ) ( R: -1 adjustment -- ) in end-of-loop case, and skip
# the interpreter pointer over the jump offset.
# ( -- ) ( R: counter -- counter+1 ) in normal case,  and adjust
# interpreter pointer by number of bytes stored after call to
# this routine.

        defasm _loop, "(loop)"
        xor %eax, %eax          # mov $1, %eax is 5 bytes
        inc %eax
        add %eax, (%ebp)
        jnc branch              # if no carry, go branch
        # If there was a carry, we're done!
        add $8, %ebp            # drop loop-sys from rstack
        lodsb                   # skip jump offset
        jmp next

# Store a cell.
        defasm bang, "!"
        pop %ebx
        pop (%ebx)              # I'm amazed this is legal
        jmp next

# Fetch a cell.
        defasm at, "@"
        pop %ebx
        push (%ebx)             # I'm amazed this is legal too
        jmp next

# Store a byte.
        defasm c_bang, "c!"
        pop %ebx
        pop %eax
        mov %al, (%ebx)         # push and pop don't do bytes
        jmp next

# Fetch a byte.
        defasm c_at, "c@"
        pop %ebx
        xor %eax, %eax
        mov (%ebx), %al
        jmp pusheax
        
# Get the return stack pointer.
        defasm rp_at, "rp@"
        push %ebp
        jmp next
        
# Set the return stack pointer.
        defasm rp_bang, "rp!"
        pop %ebp
        jmp next
        
# Pop the return stack to the data stack
        defasm rpop, "r>"
        xchg %esp, %ebp
        pop %eax
        xchg %esp, %ebp
        jmp pusheax
        
# Push the return stack from the data stack
        defasm rpush, ">r"
        lea -4(%ebp), %ebp
        pop (%ebp)
        jmp next

# Get the data stack pointer (before it gets pushed).
        defasm sp_at, "sp@"
        push %esp               # safe on 286 and later
        jmp next

# Set the data stack pointer.
        defasm sp_bang, "sp!"
        pop %esp
        jmp next
        
# Pop the stack.
        defasm drop, "pop"
        pop %eax
        jmp next
        
# Push a copy of TOS.
# eForth 1.0 used BX to index the stack here, for a couple of
# reasons: on the 8086, SP got decremented prior to the fetch,
# and also wasn't valid as a base or index register.
        defasm dup, "dup"
        pop %eax
        push %eax
        jmp pusheax
        
# Stack manipulation ( w1 w2 -- w1 w2 w1 )
# technically not necessary, but it's so easy and tiny
        defasm over, "over"
        push 4(%esp)           
        jmp next

# Swap top two stack items ("exch" in PostScript)
        defasm swap, "swap"
        pop %edx
        pop %eax
#       jmp pushedxeax       fall through because pushedxeax is next

# pusheax and pushedxeax: a prologue to 'next' that first pushes %edx
# and %eax, or just %eax.

# For a net savings of 13 bytes, last I checked, in all those
# primitives that finish up by pushing something!  Clever trick from
# F-83's 1PUSH and 2PUSH.
pushedxeax:     
        push %edx
pusheax:        
        push %eax
        # now we fall through to 'next'
                
# "next" fetches the next bytecode and runs it.  It's placed
# here in the middle of the bytecode definitions so that more
# of them can use the short two-byte jump form to get to it.

next:
        xor %eax, %eax          # set %eax to 0
        xor %ebx, %ebx          # clear high half of %ebx
        lodsb                   # load %al from where %esi points
                                # (%esi is the interpreter pointer)
execute_eax:
        ## load offset of new word into %ebx
        mov token_table(,%eax,2), %bx  # bx := token_table[eax * 2bytes]
        cmp $last_asm_bytecode, %eax
        jbe next_primitive      # if primitive, handle primitive word
        ## otherwise, handle a bytecode definition or "colon list"
        # save old %esi on return stack
        xchg %ebp, %esp
        push %esi
        xchg %ebp, %esp
        lea bytecode_start(%ebx), %esi
        jmp next

next_primitive:
        lea machine_code_primitives(%ebx), %ebx
        jmp *%ebx
        

# Push true if n negative. ( n -- f )
        defasm negative, "0<"
        pop %eax                
        cdq
        push %edx
        jmp next

# Bitwise operators:
        defasm and, "&"
        pop %eax
        pop %ebx
        and %ebx, %eax
        jmp pusheax

        defasm or, "|"
        pop %eax
        pop %ebx
        or %ebx, %eax
        jmp pusheax

        defasm xor, "^"
        pop %eax
        pop %ebx
        xor %ebx, %eax
        jmp pusheax
        
# add two unsigned numbers, returning sum and carry.
# ( u1 u2 -- u3 cy )
        defasm umplus, "um+"
        xor %eax, %eax
        pop %edx
        pop %ebx
        add %ebx, %edx
        rcl $1, %eax
        jmp pushedxeax

# Divide double-precision by single-precision, unsigned (?).
# UM/MOD from eForth.  ( udl udh un -- ur uq )
        defasm divmod, "/%"
        pop %ebx
        pop %edx
        pop %eax
        idiv %ebx
        jmp pushedxeax

# Multiply two single-precision numbers, giving a double-
# precision result. ( d1 d2 -- udl udh )
        defasm mmul, "um*"
        pop %eax
        pop %ebx
        imul %ebx
        push %eax
        push %edx
        jmp next

# Copy the top of the return stack onto the data stack.
        defasm r_at, "r@"
        push (%ebp)
        jmp next

# syscall5:   
# Linux system call with up to 5 arguments
# This is no longer the fashionable way to make system calls
# in Linux.  Now you're supposed to use SYSENTER on newer
# CPUs, and rather than have you figure out which one to use,
# the kernel mmaps a chunk of code called a VDSO into your
# memory space at a random address and tells you where to
# find it using the ELF auxiliary vector.  Then you're
# supposed to invoke the dynamic linker or something to parse
# the ELF executable mysteriously manifested in this way by
# the kernel, and then resolve an undefined symbol in libc
# into calls to it.  See "What is linux-gate.so.1?"
# http://www.trilithium.com/johan/2005/08/linux-gate/
# "The Linux kernel: System Calls" by Andries Brouwer, 2003-02-01
# http://www.win.tue.nl/%7Eaeb/linux/lk/lk-4.html
# "About ELF Auxiliary Vectors" by Manu Garg
# http://manugarg.googlepages.com/aboutelfauxiliaryvectors

# But the old int $0x80 approach still works, thank goodness,
# because all of that is *way* more than these ten
# instructions.
        defasm syscall5, "syscall5"
        pop %edi
        ## we have to save %esi for the interpreter
        mov %esi, -4(%ebp)
        pop %esi
        pop %edx
        pop %ecx
        pop %ebx
        pop %eax
        int $0x80
        mov -4(%ebp), %esi
        jmp pusheax

        last_asm_bytecode = b_syscall5

### Basic Interpreted Words
        ## a macro for defining interpreted words
        ## Because after I left off b_exit once, I wasted a long
        ## time trying to figure out what was wrong, so I use this when I can:
        .macro def name, realname, bytes:vararg
        defbytes \name, "\realname"
        .byte \bytes
        .byte b_exit
        .endm
        ## Macros for conditional branch and loop:
        ## Because I am tired of tracking down bugs due to 
        ## getting the jump offsets wrong.
        .macro fif, target      # if, or end of while loop
        .byte b_branch_on_0, \target - . - 1
        .endm
        .macro floop, target    # do loop
        .byte b__loop, \target - . - 1
        .endm
        .macro felse, target    # else, unconditional jump
        .byte b_branch, \target - . - 1
        .endm

        .data 2                 # separate subsection from token table
bytecode_start:
# System call with three arguments.
        def syscall3, "syscall3", b_zero, b_zero, b_syscall5
# System call with one argument.
        def syscall1, "syscall1", b_zero, b_zero, b_syscall3
        def bye, "bye", b_dolit_s8,__NR_exit, b_zero, b_syscall1 # exit program
        def zero, "0", b_dolit_s8,0            # push 0
        def one, "1", b_dolit_s8,1
        
# This word outputs a string whose address and count are on 
# the stack.  ( b u -- )

        defbytes type, "type"
        .byte b_rpush, b_rpush  # move two args onto rstack
                                # system call is __NR_write:    
        .byte b_dolit_s8,__NR_write
        .byte b_one             # push constant 1: stdout
        .byte b_rpop, b_rpop    # move two args back from rstack
        .byte b_syscall3        # call syscall with 3 args
        .byte b_drop            # discard return value
        .byte b_exit            # return

# The next few words exist just to poke string addresses
# and lengths onto the stack so "type" can print them.
        .macro def_counted_string name, contents
        defbytes \name, "\name"
        .byte b_dolit_32        # dolit_32 pushes a 32-bit
        .int string_\name       # literal --- an addr, here
                                # now push literal length and return
        .byte b_count, b_exit
        .pushsection .rodata    # define the actual string:
string_\name:
        countedstring "\contents"
        .popsection
        .endm

        def_counted_string hello, "hello"
        def_counted_string world, "world"
        def_counted_string comma, ", "

        # convert a counted string in memory to an address and
        # count on the stack
        def count, "count", b_dup, b_add1, b_swap, b_c_at

        def cr, "cr", b_dolit_s8, '\n, b_emit

### Some More Basic Words

        def neg1, "-1", b_dolit_s8, -1   # ( -- -1 )
        def add, "+", b_umplus, b_drop  # ( a b -- a+b ) drop the carry
        def sub1, "1-", b_neg1, b_add    # ( n -- n-1 )
        def rot, "rot", b_rpush, b_swap, b_rpop, b_swap # ( a b c -- b c a )
        def unrot, "-rot", b_rot, b_rot    # ( a b c -- c a b )
        def tuck, "tuck", b_dup, b_unrot    # ( a b -- b a b )

# emit: output a single byte.  eForth calls this "TX!".

# This version is 11 bytes, including the buffer byte, plus the 2-byte
# token table pointer. a machine-code version I wrote the other day
# was 28 bytes.  However, I also added rot, unrot, and tuck to support
# this function, and they total 11 bytes, plus 6 bytes of overhead.
# For a total of 11+2+11+6 = 30 bytes.  Not winning yet on size over
# x86 asm!  But we're getting close.

emit_buffer:
        .byte 0
        defbytes emit, "emit"
        .byte b_dolit_32
        .int emit_buffer
        .byte b_tuck            # save a copy of address for b_type
        .byte b_c_bang          # store into emit buffer
        .byte b_one, b_type, b_exit # output one-byte buffer

### "u." prints out an unsigned number.
# I had a version of this in x86 machine code in 52 bytes (23
# instructions), essentially exactly the same code as here.
# This is 31 bytes, plus 6 bytes of overhead, plus I had 
# to define b_divmod (9 bytes plus 2 bytes overhead).  Now we are
# starting to win!

        defbytes udot, "u."            # print space after number
        .byte b_udot_nospc, b_dolit_s8, 0x20, b_emit, b_exit 
        defbytes udot_nospc, "(u.)"    # print number without space
        .byte b_dup
        fif 1f
        .byte b_udot_nonzero, b_exit
1:      .byte b_drop, b_dolit_s8, '0, b_emit, b_exit
        defbytes udot_nonzero, "((u.))"
        .byte b_zero, b_dolit_s8,10, b_divmod # divide by 10
        .byte b_dup
        fif 2f                  # recurse if nonzero
        .byte b_udot_nonzero
        felse 3f
2:      .byte b_drop            # drop zero quotient
3:      .byte b_dolit_s8, '0, b_add, b_emit # print digit
        .byte b_exit
        
### Add signed numeric output, ".".  This cost 20 bytes plus 8 bytes
# of overhead, but added some fundamental numeric operations; only 12
# of those 28 bytes are specific to "."

        # logical bitwise not
        def invert, "~", b_dolit_s8, -1, b_xor
        def add1, "1+", b_one, b_add
        # arithmetic negation
        def negate, "negate", b_invert, b_add1
        # print signed number
        defbytes dot, "."
        .byte b_dup, b_negative
        fif 1f
        .byte b_dolit_s8, '-, b_emit, b_negate # in the negative case
1:      .byte b_udot, b_exit

### Obviously the next thing to do is to add ".S", print the
# stack, so that I can stop having to investigate problems by
# using gdb.

# The bytecode for this consumed 78 bytes in 12 words, plus a
# new 8-byte primitive (mmul) and a new 14-byte primitive
# (_loop), plus six bytes in the initialization routine, for 28
# bytes of overhead and a total of 78+28+14+8+6 = 134 bytes.
# This is definitely not a size win over machine code!  Machine
# code would only be 22 bytes in 7 instructions, if there were a
# way to just CALL the "." routine from machine code, which
# there isn't.
        
# However, the words added were  cells * - / cellsize nip (do)
# (loop) 2dup  which are all generally useful, and  
# depth pick .s  which are more special-purpose.
# The special-purpose words are 35 out of those 78 bytes.  

# PRINTSTACK itself is only 15 bytes, and there's hope that the
# 6 bytes of PICK and the 14 bytes of DEPTH will be useful in
# other debugging routines.

# I'm not happy with (do) and (loop), only because (do)
# implements dpANS DO, not dpANS ?DO, so it loops many times
# when it should loop zero times;

        def_counted_string scolon, "s: "
        defbytes printstack, ".s"
        .byte b_scolon, b_type
        .byte b_depth, b_zero   # loop limits
        .byte b_twodup, b_xor
        fif 1f                  # skip loop if stack empty
        .byte b__do
2:      .byte b_i, b_pick, b_dot # DO I PICK . LOOP
        floop 2b
1:      .byte b_cr, b_exit
        def pick, "pick", b_add1, b_cells, b_sp_at, b_add, b_at
        def cells, "cells", b_cellsize, b_mul
        def mul, "*", b_mmul, b_drop # drop upper 32 bits of multiplication 
result
bottom_of_stack:
        .int 0
        defbytes depth, "depth"
        .byte b_sp_at, b_dolit_32
        .int bottom_of_stack
        .byte b_at, b_swap, b_sub, b_zero, b_cellsize, b_div, b_exit

        def sub, "-", b_negate, b_add # subtract ( a b -- a-b )
        def div, "/", b_divmod, b_nip # int divide ( ul uh n -- quotient )
        def cellsize, "cellsize", b_dolit_s8,4   # ( -- 4 )
        def nip, "nip", b_swap, b_drop  # stack manipulation ( a b -- b )

        # 10 0 DO ... LOOP loops 0, 1...9.
        # _do sets up return stack for _loop
        # similar to F83: ( limit initial -- ) ( R: X -- X initial-limit limit )
        defbytes _do, "(do)"
        .byte b_over, b_sub,  b_swap,  b_rpop,  b_swap, b_rpush
        .byte b_swap, b_rpush,  b_rpush,  b_exit
        ##  return loop counter
        def i, "i", b_rpop, b_rpop, b_r_at, b_over, b_rpush, b_add, b_swap, 
b_rpush

        def twodup, "2dup", b_over, b_over
        def twodrop, "2drop", b_drop, b_drop

# Now some stuff for dealing with the dictionary.

# This stuff was from 804930a to 8049396, 140 bytes.  In that we got:
# - new words: dict dictp dictsize nextword pastdict? words < >= 0=
#   cbcmp [EMAIL PROTECTED] bcmp memcmp unloop find r2@ 2swap
# - less concretely:
#   - the ability to list of words in the dictionary;
#   - the ability to find words in the dictionary;
#   - <, >=, and 0= numerical comparisons;
#   - cbcmp, bcmp, and memcmp memory manipulations;
#   - 2swap and r2@ stack manipulations;
#   - unloop loop control;
#   - [EMAIL PROTECTED] for iterating over memory.
# That's 17 new words, averaging 8.2 bytecodes each.

dictionary_pointer:             
        .int end_of_dictionary
        defbytes dict, "dict"
        .byte b_dolit_32
        .int dictionary
        .byte b_exit
        defbytes dictp, "dictp"
        .byte b_dolit_32
        .int dictionary_pointer
        .byte b_exit
        def dictsize, "dictsize", b_dictp, b_at, b_dict, b_sub
        def nextword, "nextword", b_dup, b_c_at, b_add, b_add1
        def pastdict, "pastdict?", b_dictp, b_at, b_ge
        defbytes words, "words"
        .byte b_dict
1:      .byte b_dup, b_count, b_type, b_dolit_s8,32, b_emit
        .byte b_nextword
        .byte b_dup, b_pastdict
        fif 1b
        .byte b_exit
        def lt, "<", b_sub, b_negative
        def ge, ">=", b_lt, b_zeq
        # logical not: return true for 0, false (0) otherwise
        defbytes zeq, "0="
        fif 1f
        .byte b_zero, b_exit
1:      .byte b_neg1, b_exit

        # To find a word in the dictionary:
        # - move the word onto the return stack, and get the dictionary pointer
        # - then loop:
        #   - see if the word is at the current place
        #   - if so, clean up and return that place
        #   - otherwise, go to the next word
        #   - and repeat if we're still in the dictionary
        # - then clean up the stacks and return 0
        # Tells whether a counted string equals an address-and-count string.
        # 0 for equal, nonzero for unequal.
        # ( c-addr1 c-addr2 u -- n )
        def cbcmp, "cbcmp", b_rot, b_count, b_twoswap, b_bcmp
        # like F21 @A+: ( c-addr -- c-addr+1 char )
        def c_at_inc, "[EMAIL PROTECTED]", b_dup, b_add1, b_swap, b_c_at
        # Keep in mind memcmp() in libc is only 30 bytes long.
        # This bcmp is a little different from C memcmp or bcmp in
        # that it compares two lengths.
        defbytes bcmp, "bcmp" # ( c-addr1 u1 c-addr2 u2 -- n )
        .byte b_rot, b_over, b_xor
        fif 3f
        .byte b_twodrop, b_drop, b_one, b_exit
3:      .byte b_memcmp, b_exit
        defbytes memcmp, "memcmp"  # ( c-addr1 c-addr2 u -- n )
        .byte b_zero, b__do
2:      .byte b_c_at_inc, b_rot, b_c_at_inc, b_rot
        .byte b_sub, b_dup # - dup if
        fif 1f
        .byte b_unrot, b_twodrop, b_unloop, b_exit
1:      .byte b_drop, b_swap
        floop 2b
        .byte b_twodrop, b_zero, b_exit
        # this should probably go with the other do loop stuff
        def unloop, "unloop", b_rpop, b_rpop, b_rpop, b_twodrop, b_rpush

        # FIND: ( c-addr u -- token 1 | 0 )
        
        # 30 bytes; jonesforth 42's asm version is 56 bytes.  It's
        # fairly directly comparable, although jonesforth's FIND has
        # to do bit-masking and includes its own inline NEXTWORD and
        # CBCMP, which are actually fairly large here.
        
        defbytes find, "find"
        .byte b_rpush, b_rpush, b_zero, b_dict
1:      .byte b_dup, b_r_2at, b_cbcmp, b_zeq            # start loop
        fif 2f                  # bcmp 0= if
        .byte b_rpop, b_rpop, b_twodrop, b_drop, b_one, b_exit
2:      .byte b_swap, b_add1, b_swap, b_nextword, b_dup, b_pastdict
        fif 1b
        .byte b_rpop, b_rpop, b_twodrop, b_twodrop, b_zero, b_exit

        # copy two cells from return stack
        defbytes r_2at, "r2@"
        .byte b_rpop, b_rpop, b_r_at, b_over, b_rpush, b_rot, b_rpush, b_exit
        # ( a b c d -- c d a b )
        def twoswap, "2swap", b_rpush, b_unrot, b_rpop, b_unrot

        .macro create, name
        defbytes \name, "\name"
        .byte b_dolit_32
        .int 1f
        .byte b_exit
1:      
        .endm
        
        create "tib"
        _tibmax = 80
        .space _tibmax
        defbytes tibmax, "tibmax"
        .byte b_dolit_32
        .int _tibmax
        .byte b_exit
        create "tibsize"
        .int 0

        defbytes fgets, "fgets"
        .byte b_tib, b_tibmax, b_read, b_tibsize, b_bang # XXX handle errors
        .byte b_tib, b_tibsize, b_at, b_exit
        def gets, "gets", b_zero, b_fgets

        defbytes read, "read"
        .byte b_rpush, b_rpush, b_dolit_s8,__NR_read
        .byte b_zero             # fd 0: stdin
        .byte b_rpop, b_rpop, b_syscall3, b_exit

        def bl, "bl", b_dolit_s8,32 # space, blank

        # parse parses out a token of input from a string and leaves
        # the token's address and length atop the stack
        # ( c-addr u -- c-addr+n u-n c-addr2 u2 )
        defbytes parse, "parse"
        .byte b_skipwhitespace
        ##  XXX finish him!
        .byte b_exit

        defbytes skipwhitespace, "-wsp"
1:      .byte b_dup, b_zeq
        fif 3f                  # return empty tail
        .byte b_exit
3:      .byte b_sub1, b_swap, b_c_at_inc, b_whitespace
        fif 2f                  # escape loop if not whitespace
        .byte b_swap
        felse 1b
2:      .byte b_sub1, b_swap, b_add1, b_exit

        ## costs 9+5 bytes
        def_counted_string wsps, " \n\t"
        defbytes whitespace, "wsp"
        .byte b_dup, b_bl, b_xor
        fif 1f
        .byte b_dup, b_dolit_s8,'\n, b_xor
        fif 1f
        .byte b_dup, b_dolit_s8,'\t, b_xor
        fif 1f
        .byte b_drop, b_zero, b_exit
1:      .byte b_drop, b_one, b_exit

#         def repl, "repl", b_gets, b_eval, b_exit

#         defbytes eval, "eval"
# 2:      .byte b_parse
#         .byte b_dup, b_zeq
#         fif 3f                  # escape from loop
#         .byte b_rot, b_rpush, b_rot, b_rpush, b_find
#         fif 1f
#         .byte b_execute
# 1:      .byte b_rpop, b_rpop
#         felse 2b
# 3:      .byte b_2drop, b_2drop, b_exit

        .data 3
instructions:
        # And here is the actual "main program" in that bytecode.
        .byte b_sp_at, b_dolit_32
        .int bottom_of_stack    # variable to remember initial stack bottom
        .byte b_bang            # initialize that variable
        .byte b_hello           # string "hello" and count
        .byte b_sub1            # subtract 1 from count: "hell"
        .byte b_type            # spit it out
        .byte b_comma, b_type, b_world, b_type # ", world"
        .byte b_comma, b_type, b_hello, b_type, b_cr
        # test the "dot" command to print out numbers
        .byte b_dolit_s8, -120, b_dot
        # test positive numbers and "depth" command
        .byte b_dolit_s8, 104, b_depth, b_dot, b_dot, b_cr
        # test printstack
        .byte b_dolit_s8,100, b_dolit_s8,101, b_dolit_s8,102, b_printstack
        # test dictsize
        .byte b_dictsize, b_dot, b_cr
        .byte b_words, b_cr
        .byte b_hello, b_twodup, b_type, b_find, b_printstack
        .byte b_comma, b_twodup, b_type, b_find, b_printstack
        .byte b_dict, b_dot, b_cr
        .byte b_dolit_s8, '?, b_emit, b_bl, b_emit, b_gets, b_twodup
        .byte b_zero, b__do
1:      .byte b_c_at_inc, b_whitespace, b_dot
        floop 1b
        .byte b_drop
        .byte b_skipwhitespace, b_type
        .byte b_bye

        # At end of the assembly program, we initialize the
        # end_of_dictionary pointer by putting it at the end of the
        # assembled .dictionary section:
        .section .dictionary
end_of_dictionary:      

Reply via email to