As of version 3, the machine-code part of this interpreter is 239 bytes in 129 machine-code instructions, plus 19 bytes of read-only data, 172 bytes of token table for the 86 currently-defined words (30 of which are primitive), 397 bytes of their names, and another 370 bytes of bytecode in the 56 bytecode words (including a couple of data values scattered around), and then another 56 bytes in the main program, for a total of 1253 bytes.
I'm almost to the point of being able to parse "words separated by spaces". # Simple token-threaded "Forth" interpreter. -*- asm -*- # Version 3. # by Kragen Javier Sitaker; dedicated to the public domain, # i.e. I relinquish whatever exclusive rights copyright law # gives me with regard to this work. # Major parts taken from Richard W.M. Jones's public-domain # JONESFORTH 42 by Richard W.M. Jones <[EMAIL PROTECTED]> # http://annexia.org/forth # As of version 2, this program just outputs the following under # Linux: # hell, world, hello # -120 1 104 # s: 102 101 100 # 397 # lit8 lit ret execute (if) (else) (loop) ! @ c! c@ rp@ rp! r> >r sp@ sp! pop dup over swap 0< & | ^ um+ /% um* r@ syscall5 syscall3 syscall1 bye 0 1 type hello world comma count cr -1 + 1- rot -rot tuck emit u. (u.) ((u.)) ~ 1+ negate . scolon .s pick cells * depth - / cellsize nip (do) i 2dup 2drop dict dictp dictsize nextword pastdict? words < >= 0= cbcmp [EMAIL PROTECTED] bcmp memcmp unloop find r2@ 2swap # hello134517863 # , 0 # 134517710 # to compile: # gcc -m32 -nostdlib -static -o tokthr2 tokthr2.S ### Why Small Things are Interesting # There are still a lot of computers out there that have tens of # kilobytes of memory or less, and they cost a lot less than, # say, a cellphone. Cellphones are apparently still too # expensive for half the world's population. I want to see how # close I can get to having a comfortable programming # environment in a smaller device. # Some smallish microcontroller chips from five different # manufacturers, with current Digi-Key prices: # Name bytes RAM bytes ROM MHz price # ATtiny2313 128 2048 20 US$1.38 # ATMega48-20AU 512 4096 20 US$1.62 # MSP430F1111AIPW 128 2264 8 US$2.43 # LPC2101 2048 8192 70 US$2.52 # H8/300H Tiny 1536 8192 12 US$3.58 # M16C/R8C/Tiny/1B 1024 16384 12 US$3.54 # SX28AC/SS 136 3072 50 US$2.79 # There are essentially no 386-compatible devices in this price # range as far as I can tell; for why I'm not worried about # that, see the section "How Small This Is". # More practically and short-termly, small projects can take # less time to finish, and I feel like I need to learn about # different approaches to implementing programming languages. ### Why This is Small # The normal Forth representation of a function is as an array # of pointers to the other functions it calls, in sequence; a # few of those other functions may move the interpreter pointer # around in that array, or snarf up a constant that's stored in # the array, or stuff like that, but for the most part, the # functions just get called in sequence. This is called # "threaded code" and it's fairly compact, especially on 16-bit # systems where the pointers are only two bytes. # A traditional approach taken by Forth implementations to # reduce code size even further is called "token threading". # Rather than making arrays of 16-bit or 32-bit pointers, they # make lists of 8-bit indices into an array of pointers. This # has two advantages: # 1. the indices are one fourth the size of a list of 32-bit # pointers; # 2. it is possible to save these lists of indices somewhere # outside of memory and continue to use them even after # making some changes to the code, as long as the same # indices in the table have the same meanings. So, for # example, you could write some boot firmware in this # "bytecode". # It also has some disadvantages: # 1. You run out of space in the table. Even a fairly minimal # full Forth system contains close to 256 subroutines. You # can mitigate this by packing, say, two 12-bit pointers # every three bytes, or maybe by having a special bytecode # that looks up the next byte in an extended table. # 2. It's slower and makes the machine-code part of the program # take more space. The traditional LODSW; JMP AX version of # $NEXT from the eForth Model, which fetches and executes the # next execution token in the threaded list, is three bytes # and two instructions; my 'next' here is 41 bytes and 14 # instructions, which is big enough that I jump to it (2 # bytes) rather than making an assembler macro. Which blows # your branch target buffers to pieces. Oh well. The # performance penalty is probably two orders of magnitude # over native code, but I haven't measured it yet. I # measured an earlier version on my 700MHz PIII laptop on an # empty loop at about a factor of 3.5 over simple # direct-threading, which in turn is on the order of 10 times # slower than machine code. # Anyway, so this is an example program built using this # technique. It implements two Forthlike stacks and interpreted # subroutines, but not yet the ability to define new subroutines # at run-time. ### What's Here # I've implemented all of the primitives from C. H. Ting and # Bill Muench's public-domain (?) eForth Model 1.0, except for # the following: # - I haven't implemented their lowercase "next" (as in FOR # NEXT) because I think it's a bad idea, it's complex, and it # can be implemented at a higher level if you really need it; # instead, I implemented (loop). # - I didn't implement !IO because it's a no-op in this context; # - I haven't yet implemented ?RX, although I think it's # possible to implement it on top of syscall5, using select(); # Additionally, I implemented multiply and divide primitives. # However, some of it is untested and therefore probably broken. # Procedure call and return and the system calls do work. # Currently registers are used as follows: # %esi --- interpreter pointer; points to next byte to execute # %ebp --- return stack pointer; points to last thing pushed. This stack, # like the other one, grows downwards. # %esp --- data stack pointer; points to last thing pushed. This is # the processor's standard stack pointer; "push" and "pop" # instructions use it, which makes assembly code to use it # quite concise. The Intel "call" and "ret" instructions # would also use this stack, but they aren't used in this # program. # flags --- the "down" direction flag must be cleared. # It's probably missing a couple of primitives needed because of # the token-threading implementation strategy; the address of # the token table probably needs to be knowable, at least. # Direct and indirect threading, the normal Forth approaches to # allowing unrestricted coexistence of words written in assembly # language and interpreted Forth, both had heavy space costs # here --- close to 100% for the bytecode currently in the # system. So the inner interpreter checks, for each bytecode, # whether it is in the range of bytecodes whose interpretations # are in native code, and picks the relevant code path. This # avoids consuming any space per-word for this distinction, but with # what appears to be a heavy performance cost. # This is similar to an approach called "bit threading"; as # explained by Jeff Fox in a comp.lang.forth message # 2007-05-13, on thread "Cases where Forth seems a little # clunky": # I have seen hardware and software implemenations of # bit-threading where the msb of the address space # selects between threaded code address lists and # addresses of CODE subroutines. In both cases 0 is a # valid address and negative addresses are valid. I think # this applied to Novix. # Except that this approach uses a fraction of a bit for most # tokens instead of a whole bit. ### How Small This Is # As a point of comparison, eForth 1.0's machine-code part seems # to be 171 instructions and 399 bytes, including some data # that's mixed in there with it. # As of version 3, the machine-code part of this interpreter is # 239 bytes in 129 machine-code instructions, plus 19 bytes of # read-only data, 172 bytes of token table for the 86 # currently-defined words (30 of which are primitive), 397 # bytes of their names, and another 370 bytes of bytecode in # the 56 bytecode words (including a couple of data values # scattered around), and then another 56 bytes in the main # program, for a total of 1253 bytes. So the program is # already less than half written in assembly, in terms of # object-code size. # As of version 3, the machine-code part of this program uses # only 25 different instructions: cld; jmp, jz, jnc, jbe; push, # pop, lodsb, lodsl, xchg, mov; cmp, movsbl, inc, and, xor, or, # lea, cdq, rcl, add, idiv, imul; and int. Interestingly, many # of them are only used once. # Just before version 3, the non-comment lines were 14% # assembler macros, 51% assembly language, and 35% bytecode. # From Brad Rodriguez's January/February 1993 Computer Journal # article, "Moving Forth: Part 1: Design Decisions in the Forth # Kernel": # You can expect the usual 16-bit Forth kernel (see below) # to occupy about 8K bytes of program space. For a full # kernel that can compile Forth definitions, you should # allow a minimum of 1K byte of RAM. # I'm pretty sure I can beat the 8K requirement by quite a bit # and still be able to compile Forth definitions --- I'm hoping # for a factor of 4. Consider, from Jeff Fox's "Thoughtful # Programming", chapter 3: http://www.ultratechnology.com/forth3.htm # People assume that since Chuck has refined his Forth down to # about a 1K object that this means he has just stripped his # Forth down to a 1K kernel that will boot like in the old days # and that he is going to compile a complete Forth system on top # of the 1K before he starts an application. This is wrong. The # complete Forth system is 1K, and the reason for that is # maximize Chuck's productivity. What stops people from doing # what they need to do to solve a problem is all the time spend # solving all the related sub-problems that pop up as a result # of complex interconnections between components. To maximize # his productivity Chuck minimizes the number of these side # problems that pop up. Keep it simple, and don't get to where # you are spending 90% or 99% or you time dealing with related # sub-problems. Avoid unsolvable problems, don't waste your time # trying to solve them. # Consider also this quote from Elizabeth Rather: # As you have seen, so much depends on the specific # machine architecture. We implemented a TTC [token # threaded] Forth on some low-end AVR processors with # very limited code space, and it ran faster than a # native-code 8051 at comparable clock speed. # (2007-06-23 comp.lang.forth post on thread "Build your own # Forth for Microchip PIC (Episode 838): Threading") # http://objectmix.com/forth/168105-build-your-own-forth-microchip-pic-episode-838-threading.html # Consider also this quote from the abstract of Frank # Sergeant's 1991 "A 3-Instruction Forth for Embedded Systems # Work: Illustrated on the Motorola MC68HC11": # A 3- instruction Forth makes Forth affordable for # target systems with very limited memory. It can be # brought up quickly on strange new hardware. You don't # have to do without Forth because of memory or time # limitations. It only takes 66 bytes for the Motorola # MC68HC11. Full source is provided. . . . The absolute # minimum the target must do, it seems to me, is fetch a # byte, store a byte, and call a subroutine. # http://pygmy.utoh.org/3ins4th.html # As I said before, there are no small, cheap 386s, and so of # course the code size of this version is an approximation, and # it won't be a simple recompile to port it to one of these # other architectures; but 129 instructions' worth of assembly # are probably not that big a deal to rewrite for a new # platform. (I'll probably want to write multiply and divide # routines, though.) ### Other Things I Tried # I tried switching to caching the top of the data stack in a # register, on the theory that it would shorten things like # 'and'. Currently 'and' is pop %eax; pop %ebx; and %ebx, # %eax; push %eax; jmp next. If top of stack is cached in %eax # instead of being stored in memory, this becomes pop %ebx; and # %ebx, %eax; jmp next, which is considerably shorter. # However, most things don't change, and other things become # longer due to the extra work to save top-of-stack. I tried # using both %ebx and %eax as the top-of-stack cache. # In the version using %ebx as top-of-stack, the total size of # the machine-code part was 216 bytes, 115 instructions, # compared to 197 bytes, 112 instructions for the version using # the current strategy. In the version using %eax as # top-of-stack, it was only 215 bytes, but that's still worse # than 197. # In previous versions, all routines were machine-code routines # that you could just jmp to. High-level bytecode words began # with "call dolist", which took the saved %eip off the stack # and stuck it in %esi. Unfortunately, that added 5 bytes to # each bytecode word; in version 2, the bytecode region is 221 # bytes and contains 36 word definitions (the 30 machine-code # primitives aren't defined there) --- so 5 bytes each would # have been 180 bytes of overhead, or 82%! It also would have # required a region to be both executable and writable to # support run-time routine definition, which is kind of a pain # thse days, and also on Harvard-architecture microcontrollers. # In previous versions, the token-table entries were 32 bits # each (instead of 16 bits as they are now), which added # another 2 bytes of overhead per word. In version 2, there # are currently 66 words, so that's another 132 bytes of shaved # overhead. ### Performance (Speed) # To version 2, I added a simple program to print out a # badly-formatted 8-bit extended ASCII table; it was 11 # bytecode operations long. It executed 81615 bytecodes in all # (according to gdb). On my 700MHz PIII-Coppermine laptop from # 1999, 'time' reports CPU times varying from 4-12 # milliseconds. So it seems like it can execute about 10 000 # bytecodes in a millisecond, or about 10 million bytecodes in # a second --- about 100ns per bytecode. # It also made 1459 system calls. # That's slightly faster than Python's bytecode engine (maybe # --- probably within measurement error --- and anyway Python's # bytecodes are larger-grained), and an order of magnitude # slower than simple direct-threading, and about three times # slower than direct-threading with a simple bytecode # indirection layer. # I would be surprised if version 3 had any measurable change # in speed from version 2. # To version 3, I added a little loop to repeatedly search the # 86-word wordlist for ", ", which isn't there. I was able to # do this search 10 000 times in 4.59 CPU seconds, which is # 5.34 microseconds per comparison, or about 3700 CPU cycles. # If we eventually have 300 words in the wordlist, each search # will therefore take 1.6 milliseconds in the worst case, so # the system won't be able to compile or interpret more than # about 50-100 lines of code per second. It should probably be # pretty easy to fix this particular performance problem # (recode wordlist search in asm, restructure, or whatever) if # it comes up, but it's a bit worrisome... ### What's Wrong With This Program # - It's a long way from doing anything useful. # - There's 21 instructions of unused code which may be broken. # In Version 3, these words are in use and so probably work: # hello, sub1, type, comma, world, newline, dolit_s8, dot, # bye, exit, branch_on_0, c_bang, drop, dup, swap, negative, # umplus, divmod, syscall5, syscall3, zero, syscall1, rpop, # rpush, one, dolit_32, neg1, add, emit, tuck, udot, # udot_nospc, udot_nonzero, branch, invert, add1, negate, # xor, printstack, cells, mul, depth, div, sub, cellsize, # nip, pick, twodup, i, _do, r_at, over, at, sp_at, bang, # twoswap, r_2at, find, unloop, memcmp, bcmp, c_at_inc, # cbcmp, zeq, ge, lt, words, pastdict, nextword, dictsize, # dictp, dict, c_at. These are not tested, and therefore may be # broken: execute, rp_at, rp_bang, sp_bang, and, or. # - There's no dictionary structure yet. # - It probably needs another couple of primitives. # - There's no checking for stack overflow or underflow, but # they will break things. # - It's slow; see above about performance. ### The Beginning of the Program # .. include system library header so we know __NR_exit = 1 and # __NR_write = 4 #include <asm/unistd.h> ### The token table and dictionary # To save space, we're trying to avoid storing pointers # as much as possible. So most of the code is # represented as "tokens", which are offsets into the # "token table", which contains 16-bit offsets into # either the machine-code primitives or the data # segment. # The "dictionary" is stored in this program as a # sequence of strings, stuffed next to each other with # no intervening pointers at all. The idea is that if # you want to execute or compile a word, you walk # through the dictionary, examining each string in turn # to see if it's the right word. If so, then the # number of strings you rejected is the index into the # token table. # This implies that the dictionary needs to be stored # somewhere where it can expand without overwriting # other stuff. So I'm putting it in its own section # for the time being. I'm pretty sure that means it'll # get at least a page, which is enough space to define # at least several hundred words. Maybe at some future # time I'll copy it into .bss at boot time instead. .data 1 # Start putting stuff in data subsection 1 .align 4 ## table to define the "bytecode" instructions token_table: # "a" means "allocatable", "w" means "writable" .section .dictionary, "aw" dictionary: ## I was frustrated with the unreadability of my ## bytecode lists; I was counting token table entries ## by hand and writing bytecodes numerically. So I ## wrote a macro to help. ## Note that we are using a separate .subsection ## directive because gas 2.17 doesn't support putting ## that in the .pushsection line, even though it is ## documented to do so; see message from Maciej ## W. Rozycki on 2007-10-11, subject "Re: How to use ## .pushsection?", ## http://sourceware.org/ml/binutils/2007-10/msg00176.html ## for more details) ## The first few entries in the table of bytecodes are ## all defined in machine code; the rest are all ## defined in bytecode. The inner interpreter examines ## each bytecode to determine which category it falls ## in in order to figure out how to execute it, ## including what base address to add its offset to. ## This sucks for extensibility but rocks for ## compactness. .macro countedstring name .byte stringlength\@ 1: .ascii "\name" ## Here we count the length of the string --- computers ## are for counting bytes so people don't have to! stringlength\@ = . - 1b .endm .macro define_bytecode name, realname, origin .pushsection .data # save current position, go to data section .subsection 1 # and subsection 1, where we put the addrs b_\name = (. - token_table) / 2 # define b_foo as the index of this ptr .ifeq b_\name - 256 .error "\name got bytecode 256" .endif .short \name - \origin # insert offset which will be resolved next .popsection # return to where we were, and .pushsection .dictionary countedstring "\realname" .popsection \name: # define the name .endm .macro defasm name, realname define_bytecode \name, "\realname", machine_code_primitives .endm .macro defbytes name, realname define_bytecode \name, "\realname", bytecode_start .endm ### The Return Stack # We put Forth return addresses here, but programs can also use # it for other purposes. .bss .space 4096 initial_return_stack_pointer: ### Initialization .text # the following stuff goes in the text segment .global _start # declare _start as a global symbol # (otherwise ld won't be able to find it) _start: # this is the entry point for ELF I guess cld # clear direction flag; unnecessary? mov $initial_return_stack_pointer, %ebp mov $instructions, %esi # %esi is the interpreter pointer register jmp next # and now we start the interpreter. # (somewhat silly since we could just # fall through..) ### The Machine-Code Primitives # Also next (aka the address interpreter or inner interpreter) # is in this section. machine_code_primitives: # dolit_s8 takes a signed 8-bit literal from the instruction # stream and pushes it onto the stack. defasm dolit_s8, "lit8" lodsb movsbl %al, %eax jmp pusheax defasm dolit_32, "lit" # more general dolit lodsl jmp pusheax defasm exit, "ret" # Return from a colon defn. xchg %ebp, %esp pop %esi xchg %ebp, %esp jmp next defasm execute, "execute" # Run xt on data stack. pop %eax # Here 'xt' is the one-byte token. jmp execute_eax # Branch if top of stack is 0 (implementing IF). # Both branch instructions take a signed byte offset from the bytecode # stream. defasm branch_on_0, "(if)" pop %eax and %eax, %eax jz branch inc %esi # skip 1-byte jump offset jmp next defasm branch, "(else)" lodsb movsbl %al, %eax # same insn size as cbtw; cwde add %eax, %esi jmp next # (loop) is what we do at the end of a DO LOOP. # DO puts a number on top of the return stack that is zero minus # the number of iterations remaining. So when it finally # reaches zero, we're done. It also put a number underneath # that on the return stack from which you can recover the # iteration counter. # This scheme is mostly due to F-83, except that in F-83 it # reached 0x8000 instead of 0, which seemed perverse to me. # This isn't strictly necessary as part of the minimal primitive # set, but it seemed like it would make inner loops maybe ten # times faster, and as with most words that do return-stack # manipulation, the size penalty is actually negative. (This # version is 14 bytes; in bytecode it would be 17.) # ( -- ) ( R: -1 adjustment -- ) in end-of-loop case, and skip # the interpreter pointer over the jump offset. # ( -- ) ( R: counter -- counter+1 ) in normal case, and adjust # interpreter pointer by number of bytes stored after call to # this routine. defasm _loop, "(loop)" xor %eax, %eax # mov $1, %eax is 5 bytes inc %eax add %eax, (%ebp) jnc branch # if no carry, go branch # If there was a carry, we're done! add $8, %ebp # drop loop-sys from rstack lodsb # skip jump offset jmp next # Store a cell. defasm bang, "!" pop %ebx pop (%ebx) # I'm amazed this is legal jmp next # Fetch a cell. defasm at, "@" pop %ebx push (%ebx) # I'm amazed this is legal too jmp next # Store a byte. defasm c_bang, "c!" pop %ebx pop %eax mov %al, (%ebx) # push and pop don't do bytes jmp next # Fetch a byte. defasm c_at, "c@" pop %ebx xor %eax, %eax mov (%ebx), %al jmp pusheax # Get the return stack pointer. defasm rp_at, "rp@" push %ebp jmp next # Set the return stack pointer. defasm rp_bang, "rp!" pop %ebp jmp next # Pop the return stack to the data stack defasm rpop, "r>" xchg %esp, %ebp pop %eax xchg %esp, %ebp jmp pusheax # Push the return stack from the data stack defasm rpush, ">r" lea -4(%ebp), %ebp pop (%ebp) jmp next # Get the data stack pointer (before it gets pushed). defasm sp_at, "sp@" push %esp # safe on 286 and later jmp next # Set the data stack pointer. defasm sp_bang, "sp!" pop %esp jmp next # Pop the stack. defasm drop, "pop" pop %eax jmp next # Push a copy of TOS. # eForth 1.0 used BX to index the stack here, for a couple of # reasons: on the 8086, SP got decremented prior to the fetch, # and also wasn't valid as a base or index register. defasm dup, "dup" pop %eax push %eax jmp pusheax # Stack manipulation ( w1 w2 -- w1 w2 w1 ) # technically not necessary, but it's so easy and tiny defasm over, "over" push 4(%esp) jmp next # Swap top two stack items ("exch" in PostScript) defasm swap, "swap" pop %edx pop %eax # jmp pushedxeax fall through because pushedxeax is next # pusheax and pushedxeax: a prologue to 'next' that first pushes %edx # and %eax, or just %eax. # For a net savings of 13 bytes, last I checked, in all those # primitives that finish up by pushing something! Clever trick from # F-83's 1PUSH and 2PUSH. pushedxeax: push %edx pusheax: push %eax # now we fall through to 'next' # "next" fetches the next bytecode and runs it. It's placed # here in the middle of the bytecode definitions so that more # of them can use the short two-byte jump form to get to it. next: xor %eax, %eax # set %eax to 0 xor %ebx, %ebx # clear high half of %ebx lodsb # load %al from where %esi points # (%esi is the interpreter pointer) execute_eax: ## load offset of new word into %ebx mov token_table(,%eax,2), %bx # bx := token_table[eax * 2bytes] cmp $last_asm_bytecode, %eax jbe next_primitive # if primitive, handle primitive word ## otherwise, handle a bytecode definition or "colon list" # save old %esi on return stack xchg %ebp, %esp push %esi xchg %ebp, %esp lea bytecode_start(%ebx), %esi jmp next next_primitive: lea machine_code_primitives(%ebx), %ebx jmp *%ebx # Push true if n negative. ( n -- f ) defasm negative, "0<" pop %eax cdq push %edx jmp next # Bitwise operators: defasm and, "&" pop %eax pop %ebx and %ebx, %eax jmp pusheax defasm or, "|" pop %eax pop %ebx or %ebx, %eax jmp pusheax defasm xor, "^" pop %eax pop %ebx xor %ebx, %eax jmp pusheax # add two unsigned numbers, returning sum and carry. # ( u1 u2 -- u3 cy ) defasm umplus, "um+" xor %eax, %eax pop %edx pop %ebx add %ebx, %edx rcl $1, %eax jmp pushedxeax # Divide double-precision by single-precision, unsigned (?). # UM/MOD from eForth. ( udl udh un -- ur uq ) defasm divmod, "/%" pop %ebx pop %edx pop %eax idiv %ebx jmp pushedxeax # Multiply two single-precision numbers, giving a double- # precision result. ( d1 d2 -- udl udh ) defasm mmul, "um*" pop %eax pop %ebx imul %ebx push %eax push %edx jmp next # Copy the top of the return stack onto the data stack. defasm r_at, "r@" push (%ebp) jmp next # syscall5: # Linux system call with up to 5 arguments # This is no longer the fashionable way to make system calls # in Linux. Now you're supposed to use SYSENTER on newer # CPUs, and rather than have you figure out which one to use, # the kernel mmaps a chunk of code called a VDSO into your # memory space at a random address and tells you where to # find it using the ELF auxiliary vector. Then you're # supposed to invoke the dynamic linker or something to parse # the ELF executable mysteriously manifested in this way by # the kernel, and then resolve an undefined symbol in libc # into calls to it. See "What is linux-gate.so.1?" # http://www.trilithium.com/johan/2005/08/linux-gate/ # "The Linux kernel: System Calls" by Andries Brouwer, 2003-02-01 # http://www.win.tue.nl/%7Eaeb/linux/lk/lk-4.html # "About ELF Auxiliary Vectors" by Manu Garg # http://manugarg.googlepages.com/aboutelfauxiliaryvectors # But the old int $0x80 approach still works, thank goodness, # because all of that is *way* more than these ten # instructions. defasm syscall5, "syscall5" pop %edi ## we have to save %esi for the interpreter mov %esi, -4(%ebp) pop %esi pop %edx pop %ecx pop %ebx pop %eax int $0x80 mov -4(%ebp), %esi jmp pusheax last_asm_bytecode = b_syscall5 ### Basic Interpreted Words ## a macro for defining interpreted words ## Because after I left off b_exit once, I wasted a long ## time trying to figure out what was wrong, so I use this when I can: .macro def name, realname, bytes:vararg defbytes \name, "\realname" .byte \bytes .byte b_exit .endm ## Macros for conditional branch and loop: ## Because I am tired of tracking down bugs due to ## getting the jump offsets wrong. .macro fif, target # if, or end of while loop .byte b_branch_on_0, \target - . - 1 .endm .macro floop, target # do loop .byte b__loop, \target - . - 1 .endm .macro felse, target # else, unconditional jump .byte b_branch, \target - . - 1 .endm .data 2 # separate subsection from token table bytecode_start: # System call with three arguments. def syscall3, "syscall3", b_zero, b_zero, b_syscall5 # System call with one argument. def syscall1, "syscall1", b_zero, b_zero, b_syscall3 def bye, "bye", b_dolit_s8,__NR_exit, b_zero, b_syscall1 # exit program def zero, "0", b_dolit_s8,0 # push 0 def one, "1", b_dolit_s8,1 # This word outputs a string whose address and count are on # the stack. ( b u -- ) defbytes type, "type" .byte b_rpush, b_rpush # move two args onto rstack # system call is __NR_write: .byte b_dolit_s8,__NR_write .byte b_one # push constant 1: stdout .byte b_rpop, b_rpop # move two args back from rstack .byte b_syscall3 # call syscall with 3 args .byte b_drop # discard return value .byte b_exit # return # The next few words exist just to poke string addresses # and lengths onto the stack so "type" can print them. .macro def_counted_string name, contents defbytes \name, "\name" .byte b_dolit_32 # dolit_32 pushes a 32-bit .int string_\name # literal --- an addr, here # now push literal length and return .byte b_count, b_exit .pushsection .rodata # define the actual string: string_\name: countedstring "\contents" .popsection .endm def_counted_string hello, "hello" def_counted_string world, "world" def_counted_string comma, ", " # convert a counted string in memory to an address and # count on the stack def count, "count", b_dup, b_add1, b_swap, b_c_at def cr, "cr", b_dolit_s8, '\n, b_emit ### Some More Basic Words def neg1, "-1", b_dolit_s8, -1 # ( -- -1 ) def add, "+", b_umplus, b_drop # ( a b -- a+b ) drop the carry def sub1, "1-", b_neg1, b_add # ( n -- n-1 ) def rot, "rot", b_rpush, b_swap, b_rpop, b_swap # ( a b c -- b c a ) def unrot, "-rot", b_rot, b_rot # ( a b c -- c a b ) def tuck, "tuck", b_dup, b_unrot # ( a b -- b a b ) # emit: output a single byte. eForth calls this "TX!". # This version is 11 bytes, including the buffer byte, plus the 2-byte # token table pointer. a machine-code version I wrote the other day # was 28 bytes. However, I also added rot, unrot, and tuck to support # this function, and they total 11 bytes, plus 6 bytes of overhead. # For a total of 11+2+11+6 = 30 bytes. Not winning yet on size over # x86 asm! But we're getting close. emit_buffer: .byte 0 defbytes emit, "emit" .byte b_dolit_32 .int emit_buffer .byte b_tuck # save a copy of address for b_type .byte b_c_bang # store into emit buffer .byte b_one, b_type, b_exit # output one-byte buffer ### "u." prints out an unsigned number. # I had a version of this in x86 machine code in 52 bytes (23 # instructions), essentially exactly the same code as here. # This is 31 bytes, plus 6 bytes of overhead, plus I had # to define b_divmod (9 bytes plus 2 bytes overhead). Now we are # starting to win! defbytes udot, "u." # print space after number .byte b_udot_nospc, b_dolit_s8, 0x20, b_emit, b_exit defbytes udot_nospc, "(u.)" # print number without space .byte b_dup fif 1f .byte b_udot_nonzero, b_exit 1: .byte b_drop, b_dolit_s8, '0, b_emit, b_exit defbytes udot_nonzero, "((u.))" .byte b_zero, b_dolit_s8,10, b_divmod # divide by 10 .byte b_dup fif 2f # recurse if nonzero .byte b_udot_nonzero felse 3f 2: .byte b_drop # drop zero quotient 3: .byte b_dolit_s8, '0, b_add, b_emit # print digit .byte b_exit ### Add signed numeric output, ".". This cost 20 bytes plus 8 bytes # of overhead, but added some fundamental numeric operations; only 12 # of those 28 bytes are specific to "." # logical bitwise not def invert, "~", b_dolit_s8, -1, b_xor def add1, "1+", b_one, b_add # arithmetic negation def negate, "negate", b_invert, b_add1 # print signed number defbytes dot, "." .byte b_dup, b_negative fif 1f .byte b_dolit_s8, '-, b_emit, b_negate # in the negative case 1: .byte b_udot, b_exit ### Obviously the next thing to do is to add ".S", print the # stack, so that I can stop having to investigate problems by # using gdb. # The bytecode for this consumed 78 bytes in 12 words, plus a # new 8-byte primitive (mmul) and a new 14-byte primitive # (_loop), plus six bytes in the initialization routine, for 28 # bytes of overhead and a total of 78+28+14+8+6 = 134 bytes. # This is definitely not a size win over machine code! Machine # code would only be 22 bytes in 7 instructions, if there were a # way to just CALL the "." routine from machine code, which # there isn't. # However, the words added were cells * - / cellsize nip (do) # (loop) 2dup which are all generally useful, and # depth pick .s which are more special-purpose. # The special-purpose words are 35 out of those 78 bytes. # PRINTSTACK itself is only 15 bytes, and there's hope that the # 6 bytes of PICK and the 14 bytes of DEPTH will be useful in # other debugging routines. # I'm not happy with (do) and (loop), only because (do) # implements dpANS DO, not dpANS ?DO, so it loops many times # when it should loop zero times; def_counted_string scolon, "s: " defbytes printstack, ".s" .byte b_scolon, b_type .byte b_depth, b_zero # loop limits .byte b_twodup, b_xor fif 1f # skip loop if stack empty .byte b__do 2: .byte b_i, b_pick, b_dot # DO I PICK . LOOP floop 2b 1: .byte b_cr, b_exit def pick, "pick", b_add1, b_cells, b_sp_at, b_add, b_at def cells, "cells", b_cellsize, b_mul def mul, "*", b_mmul, b_drop # drop upper 32 bits of multiplication result bottom_of_stack: .int 0 defbytes depth, "depth" .byte b_sp_at, b_dolit_32 .int bottom_of_stack .byte b_at, b_swap, b_sub, b_zero, b_cellsize, b_div, b_exit def sub, "-", b_negate, b_add # subtract ( a b -- a-b ) def div, "/", b_divmod, b_nip # int divide ( ul uh n -- quotient ) def cellsize, "cellsize", b_dolit_s8,4 # ( -- 4 ) def nip, "nip", b_swap, b_drop # stack manipulation ( a b -- b ) # 10 0 DO ... LOOP loops 0, 1...9. # _do sets up return stack for _loop # similar to F83: ( limit initial -- ) ( R: X -- X initial-limit limit ) defbytes _do, "(do)" .byte b_over, b_sub, b_swap, b_rpop, b_swap, b_rpush .byte b_swap, b_rpush, b_rpush, b_exit ## return loop counter def i, "i", b_rpop, b_rpop, b_r_at, b_over, b_rpush, b_add, b_swap, b_rpush def twodup, "2dup", b_over, b_over def twodrop, "2drop", b_drop, b_drop # Now some stuff for dealing with the dictionary. # This stuff was from 804930a to 8049396, 140 bytes. In that we got: # - new words: dict dictp dictsize nextword pastdict? words < >= 0= # cbcmp [EMAIL PROTECTED] bcmp memcmp unloop find r2@ 2swap # - less concretely: # - the ability to list of words in the dictionary; # - the ability to find words in the dictionary; # - <, >=, and 0= numerical comparisons; # - cbcmp, bcmp, and memcmp memory manipulations; # - 2swap and r2@ stack manipulations; # - unloop loop control; # - [EMAIL PROTECTED] for iterating over memory. # That's 17 new words, averaging 8.2 bytecodes each. dictionary_pointer: .int end_of_dictionary defbytes dict, "dict" .byte b_dolit_32 .int dictionary .byte b_exit defbytes dictp, "dictp" .byte b_dolit_32 .int dictionary_pointer .byte b_exit def dictsize, "dictsize", b_dictp, b_at, b_dict, b_sub def nextword, "nextword", b_dup, b_c_at, b_add, b_add1 def pastdict, "pastdict?", b_dictp, b_at, b_ge defbytes words, "words" .byte b_dict 1: .byte b_dup, b_count, b_type, b_dolit_s8,32, b_emit .byte b_nextword .byte b_dup, b_pastdict fif 1b .byte b_exit def lt, "<", b_sub, b_negative def ge, ">=", b_lt, b_zeq # logical not: return true for 0, false (0) otherwise defbytes zeq, "0=" fif 1f .byte b_zero, b_exit 1: .byte b_neg1, b_exit # To find a word in the dictionary: # - move the word onto the return stack, and get the dictionary pointer # - then loop: # - see if the word is at the current place # - if so, clean up and return that place # - otherwise, go to the next word # - and repeat if we're still in the dictionary # - then clean up the stacks and return 0 # Tells whether a counted string equals an address-and-count string. # 0 for equal, nonzero for unequal. # ( c-addr1 c-addr2 u -- n ) def cbcmp, "cbcmp", b_rot, b_count, b_twoswap, b_bcmp # like F21 @A+: ( c-addr -- c-addr+1 char ) def c_at_inc, "[EMAIL PROTECTED]", b_dup, b_add1, b_swap, b_c_at # Keep in mind memcmp() in libc is only 30 bytes long. # This bcmp is a little different from C memcmp or bcmp in # that it compares two lengths. defbytes bcmp, "bcmp" # ( c-addr1 u1 c-addr2 u2 -- n ) .byte b_rot, b_over, b_xor fif 3f .byte b_twodrop, b_drop, b_one, b_exit 3: .byte b_memcmp, b_exit defbytes memcmp, "memcmp" # ( c-addr1 c-addr2 u -- n ) .byte b_zero, b__do 2: .byte b_c_at_inc, b_rot, b_c_at_inc, b_rot .byte b_sub, b_dup # - dup if fif 1f .byte b_unrot, b_twodrop, b_unloop, b_exit 1: .byte b_drop, b_swap floop 2b .byte b_twodrop, b_zero, b_exit # this should probably go with the other do loop stuff def unloop, "unloop", b_rpop, b_rpop, b_rpop, b_twodrop, b_rpush # FIND: ( c-addr u -- token 1 | 0 ) # 30 bytes; jonesforth 42's asm version is 56 bytes. It's # fairly directly comparable, although jonesforth's FIND has # to do bit-masking and includes its own inline NEXTWORD and # CBCMP, which are actually fairly large here. defbytes find, "find" .byte b_rpush, b_rpush, b_zero, b_dict 1: .byte b_dup, b_r_2at, b_cbcmp, b_zeq # start loop fif 2f # bcmp 0= if .byte b_rpop, b_rpop, b_twodrop, b_drop, b_one, b_exit 2: .byte b_swap, b_add1, b_swap, b_nextword, b_dup, b_pastdict fif 1b .byte b_rpop, b_rpop, b_twodrop, b_twodrop, b_zero, b_exit # copy two cells from return stack defbytes r_2at, "r2@" .byte b_rpop, b_rpop, b_r_at, b_over, b_rpush, b_rot, b_rpush, b_exit # ( a b c d -- c d a b ) def twoswap, "2swap", b_rpush, b_unrot, b_rpop, b_unrot .macro create, name defbytes \name, "\name" .byte b_dolit_32 .int 1f .byte b_exit 1: .endm create "tib" _tibmax = 80 .space _tibmax defbytes tibmax, "tibmax" .byte b_dolit_32 .int _tibmax .byte b_exit create "tibsize" .int 0 defbytes fgets, "fgets" .byte b_tib, b_tibmax, b_read, b_tibsize, b_bang # XXX handle errors .byte b_tib, b_tibsize, b_at, b_exit def gets, "gets", b_zero, b_fgets defbytes read, "read" .byte b_rpush, b_rpush, b_dolit_s8,__NR_read .byte b_zero # fd 0: stdin .byte b_rpop, b_rpop, b_syscall3, b_exit def bl, "bl", b_dolit_s8,32 # space, blank # parse parses out a token of input from a string and leaves # the token's address and length atop the stack # ( c-addr u -- c-addr+n u-n c-addr2 u2 ) defbytes parse, "parse" .byte b_skipwhitespace ## XXX finish him! .byte b_exit defbytes skipwhitespace, "-wsp" 1: .byte b_dup, b_zeq fif 3f # return empty tail .byte b_exit 3: .byte b_sub1, b_swap, b_c_at_inc, b_whitespace fif 2f # escape loop if not whitespace .byte b_swap felse 1b 2: .byte b_sub1, b_swap, b_add1, b_exit ## costs 9+5 bytes def_counted_string wsps, " \n\t" defbytes whitespace, "wsp" .byte b_dup, b_bl, b_xor fif 1f .byte b_dup, b_dolit_s8,'\n, b_xor fif 1f .byte b_dup, b_dolit_s8,'\t, b_xor fif 1f .byte b_drop, b_zero, b_exit 1: .byte b_drop, b_one, b_exit # def repl, "repl", b_gets, b_eval, b_exit # defbytes eval, "eval" # 2: .byte b_parse # .byte b_dup, b_zeq # fif 3f # escape from loop # .byte b_rot, b_rpush, b_rot, b_rpush, b_find # fif 1f # .byte b_execute # 1: .byte b_rpop, b_rpop # felse 2b # 3: .byte b_2drop, b_2drop, b_exit .data 3 instructions: # And here is the actual "main program" in that bytecode. .byte b_sp_at, b_dolit_32 .int bottom_of_stack # variable to remember initial stack bottom .byte b_bang # initialize that variable .byte b_hello # string "hello" and count .byte b_sub1 # subtract 1 from count: "hell" .byte b_type # spit it out .byte b_comma, b_type, b_world, b_type # ", world" .byte b_comma, b_type, b_hello, b_type, b_cr # test the "dot" command to print out numbers .byte b_dolit_s8, -120, b_dot # test positive numbers and "depth" command .byte b_dolit_s8, 104, b_depth, b_dot, b_dot, b_cr # test printstack .byte b_dolit_s8,100, b_dolit_s8,101, b_dolit_s8,102, b_printstack # test dictsize .byte b_dictsize, b_dot, b_cr .byte b_words, b_cr .byte b_hello, b_twodup, b_type, b_find, b_printstack .byte b_comma, b_twodup, b_type, b_find, b_printstack .byte b_dict, b_dot, b_cr .byte b_dolit_s8, '?, b_emit, b_bl, b_emit, b_gets, b_twodup .byte b_zero, b__do 1: .byte b_c_at_inc, b_whitespace, b_dot floop 1b .byte b_drop .byte b_skipwhitespace, b_type .byte b_bye # At end of the assembly program, we initialize the # end_of_dictionary pointer by putting it at the end of the # assembled .dictionary section: .section .dictionary end_of_dictionary: