simplistic symbol tables in C and Forth

Kragen Javier Sitaker Sat, 19 Jun 2010 00:37:06 -0700

One of the things I’ve been wanting to add to [StoneKnifeForth] [] is
multi-character tokens. It’s one of those things that distinguishes
between “real compilers” and “toy compilers”, and SKF is pretty firmly
in the latter category at the moment.


[StoneKnifeForth]: http://github.com/kragen/stoneknifeforth

So I wrote a sketch of what a multi-character token facility might
look like, first in C, and then in ANS Forth, the idea being that ANS
Forth is closer to StoneKnifeForth than C is.

Many thanks to Darius Bacon for help with this.

Here’s the C version, `tiny-symbol-table.c`:

    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>

    /*
    StoneKnifeForth would really benefit a lot from a real symbol
    table. But that will probably make it run slower. The simplest
    possible symbol table would be what Ur-Scheme uses: a linear list. But
    that might be *too* slow. Nevertheless, here it is.
    */

    enum { entry_capacity = 1024, name_capacity = 4096 };

    static char names[name_capacity], *input_ptr = names, *name_ptr = names;
    void *values[entry_capacity];
    void **entry;            /* Points to the entry found by find_entry */

    /* Returns true if found; entry found is left in “entry”. */
    int find_entry(char *s) {
      char *sp = names;
      for (entry = values; sp < name_ptr && 0 != strcmp(s, sp); sp += 
strlen(sp)+1)
        if (++entry == values + entry_capacity) abort();
      return sp < name_ptr;
    }

    inline void read_character(char c) { 
      if (names + name_capacity == input_ptr) abort(); /* buffer full. */
      *input_ptr++ = c; 
    }

    inline void discard_name() { input_ptr = name_ptr; }

    void **name_done() {
      read_character(0);
      if (!find_entry(name_ptr)) return 0;
      discard_name(); /* It was found, so we don’t need to save another copy. */
      return entry;
    }

    void save_name(void *value) {
      *entry = value;
      name_ptr = input_ptr;
    }


    /* That's 21 lines of code, 865 characters including the
    comments. Before that simplification, I had a hash table version there
    that was 28 lines, 1043 characters and was about 20x faster: it ran
    about 110-200 CPU instructions per input byte, while this version is
    closer to 2200 on a small test, and O(N^2) to boot.

    The external interface (designed for extreme simplicity of
    implementation rather than ease of use) is:

    - Call read_character() for each character in the name until you
      reach the end of the name.
    - Then you call name_done() to NUL-terminate it, which returns a
      pointer to the void pointer so you can get and set its value, in
      which case it deallocates the buffer, or NULL if not found, in which
      case it doesn’t.
    - In the not-found case, you either call save_name() (to set a
      value and switch to a new buffer) or discard_name().

    Here’s an example program built on top of that symbol table. It counts
    the number of occurrences of each “word” in its input.

    */

    int main() __attribute__ ((weak));

    int main() {
      int c;
      char *n;
      void **val;

      do {
        c = getchar();
        if (   ('a' <= c && c <= 'z')
            || ('A' <= c && c <= 'Z')
            || ('0' <= c && c <= '9')
            || (128 <= c && c <= 255) /* UTF-8 */
            || '\'' == c 
            ||  '_' == c) {
          read_character(c);
        } else {
          void **val = name_done();
          if (val) {
            *val = (void*)((int)*val + 1);
          } else {
            save_name((void*)1);
          }
        }
      } while (c != EOF);

      for (n = names, val = values; n < input_ptr; n += strlen(n)+1, val++) {
        if (strlen(n)) printf("%s: %d\n", n, (int)*val);
      }

      return 0;
    }

(End of `tiny-symbol-table.c`.)

The Forth version isn’t really any shorter, although I think it's not
as transparent as the C version, and it was certainly harder for me to
get working. I am not going to claim I know Forth, so this code may be
horrible. This is `tiny-symbol-table.fs`:

    \ ANS FORTH translation of tiny-symbol-table.c.

    1024 constant entry_capacity  4096 constant name_capacity

    create names name_capacity allot    create values entry_capacity cells allot
    variable name_ptr names name_ptr !  variable entry    0 name_ptr @ c!

    : streq 2dup c@ swap c@ <> if 2drop 0 else count rot count compare 0= then ;

    variable sp  : next-string! dup @ c@ 1+ swap +! ;
    : next_entry cell entry +!  entry @  values entry_capacity cells +
        = abort" out of entries"  sp next-string! ;
    : find_entry  names sp !  values entry !
        begin sp @ name_ptr @ u<  if dup sp @ streq invert else 0 then
        while next_entry repeat   drop  sp @ name_ptr @ u< ;

    : input_ptr name_ptr @ dup c@ + 1+ ;  : discard_name 0 name_ptr @ c! ;
    : read_character  names name_capacity + input_ptr = abort" buffer full" 
        input_ptr c!  name_ptr @ c@ 1+  name_ptr @ c! ;
    : name_done  name_ptr @ find_entry  if discard_name  entry @ else 0 then ;
    : save_name entry @ !  name_ptr next-string! 0 name_ptr @ c! ;

    \ The part above takes 15 lines, while the C version takes 21, but
    \ it's almost the same number of bytes.

    \ The throw here doesn’t seem to actually work when stdin is closed. Oh 
well.
    : ceof -1 ;  : getchar pad 1 stdin read-file throw if pad c@ else ceof then 
;

    : islower [char] a [char] z 1+ within ;  : isupper [char] A [char] Z 1+ 
within ;
    : isdigit [char] 0 [char] 9 1+ within ;  : isutf-8 128 256 within ;
    : isspecial dup [char] ' = swap [char] _ = or ; 
    : wordchar
        dup islower over isupper or over isdigit or over isutf-8 or swap 
isspecial or ;

    \ debugging utilities:
    : .flag if ."  yes " else ."  no  " then ;
    : dump-names cr values entry_capacity cells type  cr names name_capacity 
type ;
    : test-wordchar 256 0 do i 8 mod 0= if cr then
                             i emit [char] : emit
                             i wordchar .flag      loop ;
    \ end debugging utilities

    : fill-table begin getchar dup dup wordchar if read_character else
                         drop name_done ?dup if 1 swap +! else 1 save_name then
                       then ceof <> while repeat ;
    variable n variable val  : .nospc 0 <<# #s #> type #>> ;
    : dump-entry  n @ count type ." : " val @ @ .nospc cr ;
    : dump-table names n ! values val !
        begin n @ name_ptr @ u< while n @ c@ if dump-entry then
                n next-string!  cell val +!                     repeat ;

    fill-table dump-table bye

    \ That took 15 more lines to reproduce main() from the C version; and
    \ although it's more horizontal, it's actually *longer* in bytes.

(End of `tiny-symbol-table.fs`.)

I haven’t benchmarked to see if the Forth version is faster than the C
version, due to its more efficient way of stepping past strings; with
Bigforth, it easily could be.

Like everything else posted to kragen-hacks without a notice to the
contrary, this software is in the public domain.

-- 
To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-hacks

simplistic symbol tables in C and Forth

Reply via email to