-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I've noticed some sluggishness in m4's hand-written symbol hash table when the user has lots of symbols, particularly after some recent work on autoconf experimenting with O(n) unique set entry (compared to the current O(n^2) m4_append_uniq) by creating a new witness macro for each unique entry. In m4 1.4.x, the symbol table is hardcoded to 509 buckets unless you use -H, but autom4te does not use it. Therefore, with coreutils' configure.ac having more than 450 AC_SUBST, along with all the other m4sugar macros, the pigeonhole principle says that running autoconf on coreutils is suffering from quite a few hash collisions, such that hash lookup is no longer O(1).
I plan on adding automatic table resizing in m4. I could borrow from the master branch, which was patched several years ago to always using 2^n-1 as the new bucket count. Or I could use a gnulib module - both linkedhash-list and hash appear to fit the bill. But there are some differences: Does it really matter whether the set size is prime vs. 2^n-1 in how likely a modulo operation in the hash is to cause collisions? Both gnulib modules always use a prime (linkedhash-list via a hard-coded table of primes in gl_anyhash_list2.h, hash via a runtime check next_prime in hash.c). Of the options, I don't like the speed penalty of a runtime computation. It seems like linkedhash-list has some other speed improvements over hash. ~ In hash, the hashing callback is given the number of buckets, and must perform the modulo itself (not to mention that the sample function hash_string, when !USE_DIFF_HASH, computes a division on every byte of the string, rather than once at the end). Since the hash depends on the number of buckets, the hash callback must be called every time the table is resized, and it does not make sense to cache the hash value, so every comparison of table elements must invoke the comparison callback. In linkedhash-list, the hash function needs only compute a size_t value, which is then cached alongside the entry; that way, resizing the table does not need to call the hash callback, and when doing element comparisons, the comparison callback need only be called in the unlikely case that the cached hash values do not match. Is the memory cost of saving the size_t hash worth this speedup, or are collisions and resizing generally rare enough that the number of extra callbacks is in the noise compared to the memory savings? Another difference: linkedlist-hash is hard-coded on its growth parameters, while hash allows the user to specify both fullness threshold and growth factors (and even shrink factors, as elements are removed). I'm not sure if m4 needs the flexibility of dynamic growth factors, so I could go either way. The hash module has a few more convenience wrappers, such as hash_do_for_each or hash_get_entries. On the other hand, linkedlist-hash allows the notion of a sorted hash table, where the elements can be retrieved in sorted order rather than having to call qsort at the end, at the expense of a slower insertion. m4 is already using linkedhash-list (indirectly, via its use of avltree-oset), so at this point, I'm leaning towards ditching m4's hand-rolled implementation and using linkedhash-list for minimal code size. Any other comments on recommended hash implementations or how either gnulib module could borrow ideas from each other? - -- Don't work too hard, make some time for fun as well! Eric Blake [EMAIL PROTECTED] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Cygwin) Comment: Public key at home.comcast.net/~ericblake/eblake.gpg Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkiGm/4ACgkQ84KuGfSFAYCtHwCcCyzDnN53lHbrU194+cpSJ5QV vjsAn2vBvdmuRyALmdiVuH0QcDjCcLNK =MXFM -----END PGP SIGNATURE----- _______________________________________________ m4-discuss mailing list m4-discuss@gnu.org http://lists.gnu.org/mailman/listinfo/m4-discuss