One of the things I’ve been wanting to add to [StoneKnifeForth] [] is multi-character tokens. It’s one of those things that distinguishes between “real compilers” and “toy compilers”, and SKF is pretty firmly in the latter category at the moment.
[StoneKnifeForth]: http://github.com/kragen/stoneknifeforth So I wrote a sketch of what a multi-character token facility might look like, first in C, and then in ANS Forth, the idea being that ANS Forth is closer to StoneKnifeForth than C is. Many thanks to Darius Bacon for help with this. Here’s the C version, `tiny-symbol-table.c`: #include <stdio.h> #include <stdlib.h> #include <string.h> /* StoneKnifeForth would really benefit a lot from a real symbol table. But that will probably make it run slower. The simplest possible symbol table would be what Ur-Scheme uses: a linear list. But that might be *too* slow. Nevertheless, here it is. */ enum { entry_capacity = 1024, name_capacity = 4096 }; static char names[name_capacity], *input_ptr = names, *name_ptr = names; void *values[entry_capacity]; void **entry; /* Points to the entry found by find_entry */ /* Returns true if found; entry found is left in “entry”. */ int find_entry(char *s) { char *sp = names; for (entry = values; sp < name_ptr && 0 != strcmp(s, sp); sp += strlen(sp)+1) if (++entry == values + entry_capacity) abort(); return sp < name_ptr; } inline void read_character(char c) { if (names + name_capacity == input_ptr) abort(); /* buffer full. */ *input_ptr++ = c; } inline void discard_name() { input_ptr = name_ptr; } void **name_done() { read_character(0); if (!find_entry(name_ptr)) return 0; discard_name(); /* It was found, so we don’t need to save another copy. */ return entry; } void save_name(void *value) { *entry = value; name_ptr = input_ptr; } /* That's 21 lines of code, 865 characters including the comments. Before that simplification, I had a hash table version there that was 28 lines, 1043 characters and was about 20x faster: it ran about 110-200 CPU instructions per input byte, while this version is closer to 2200 on a small test, and O(N^2) to boot. The external interface (designed for extreme simplicity of implementation rather than ease of use) is: - Call read_character() for each character in the name until you reach the end of the name. - Then you call name_done() to NUL-terminate it, which returns a pointer to the void pointer so you can get and set its value, in which case it deallocates the buffer, or NULL if not found, in which case it doesn’t. - In the not-found case, you either call save_name() (to set a value and switch to a new buffer) or discard_name(). Here’s an example program built on top of that symbol table. It counts the number of occurrences of each “word” in its input. */ int main() __attribute__ ((weak)); int main() { int c; char *n; void **val; do { c = getchar(); if ( ('a' <= c && c <= 'z') || ('A' <= c && c <= 'Z') || ('0' <= c && c <= '9') || (128 <= c && c <= 255) /* UTF-8 */ || '\'' == c || '_' == c) { read_character(c); } else { void **val = name_done(); if (val) { *val = (void*)((int)*val + 1); } else { save_name((void*)1); } } } while (c != EOF); for (n = names, val = values; n < input_ptr; n += strlen(n)+1, val++) { if (strlen(n)) printf("%s: %d\n", n, (int)*val); } return 0; } (End of `tiny-symbol-table.c`.) The Forth version isn’t really any shorter, although I think it's not as transparent as the C version, and it was certainly harder for me to get working. I am not going to claim I know Forth, so this code may be horrible. This is `tiny-symbol-table.fs`: \ ANS FORTH translation of tiny-symbol-table.c. 1024 constant entry_capacity 4096 constant name_capacity create names name_capacity allot create values entry_capacity cells allot variable name_ptr names name_ptr ! variable entry 0 name_ptr @ c! : streq 2dup c@ swap c@ <> if 2drop 0 else count rot count compare 0= then ; variable sp : next-string! dup @ c@ 1+ swap +! ; : next_entry cell entry +! entry @ values entry_capacity cells + = abort" out of entries" sp next-string! ; : find_entry names sp ! values entry ! begin sp @ name_ptr @ u< if dup sp @ streq invert else 0 then while next_entry repeat drop sp @ name_ptr @ u< ; : input_ptr name_ptr @ dup c@ + 1+ ; : discard_name 0 name_ptr @ c! ; : read_character names name_capacity + input_ptr = abort" buffer full" input_ptr c! name_ptr @ c@ 1+ name_ptr @ c! ; : name_done name_ptr @ find_entry if discard_name entry @ else 0 then ; : save_name entry @ ! name_ptr next-string! 0 name_ptr @ c! ; \ The part above takes 15 lines, while the C version takes 21, but \ it's almost the same number of bytes. \ The throw here doesn’t seem to actually work when stdin is closed. Oh well. : ceof -1 ; : getchar pad 1 stdin read-file throw if pad c@ else ceof then ; : islower [char] a [char] z 1+ within ; : isupper [char] A [char] Z 1+ within ; : isdigit [char] 0 [char] 9 1+ within ; : isutf-8 128 256 within ; : isspecial dup [char] ' = swap [char] _ = or ; : wordchar dup islower over isupper or over isdigit or over isutf-8 or swap isspecial or ; \ debugging utilities: : .flag if ." yes " else ." no " then ; : dump-names cr values entry_capacity cells type cr names name_capacity type ; : test-wordchar 256 0 do i 8 mod 0= if cr then i emit [char] : emit i wordchar .flag loop ; \ end debugging utilities : fill-table begin getchar dup dup wordchar if read_character else drop name_done ?dup if 1 swap +! else 1 save_name then then ceof <> while repeat ; variable n variable val : .nospc 0 <<# #s #> type #>> ; : dump-entry n @ count type ." : " val @ @ .nospc cr ; : dump-table names n ! values val ! begin n @ name_ptr @ u< while n @ c@ if dump-entry then n next-string! cell val +! repeat ; fill-table dump-table bye \ That took 15 more lines to reproduce main() from the C version; and \ although it's more horizontal, it's actually *longer* in bytes. (End of `tiny-symbol-table.fs`.) I haven’t benchmarked to see if the Forth version is faster than the C version, due to its more efficient way of stepping past strings; with Bigforth, it easily could be. Like everything else posted to kragen-hacks without a notice to the contrary, this software is in the public domain. -- To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-hacks
