The hashtable currently has correctness and efficiency problems resulting from string encodings (and their interaction with garbage collection).
Correctness example: Currently, string hash values are computed by going straight to the actual data and running a fairly typical hashing algorithm. This results in identical strings with different encodings having differing hash values. Not only that, but when walking through the chain of buckets to look up a key, encoding *is* taken into account and so if you happen to map to the same bucket as an equivalent string, you will return it. So if your encoding is different, sometimes you'll get found, sometimes you won't. A clear bug, fixable by canonicalizing the encoding before computing the hash value. Performance example: When resizing a hash, all existings buckets are swept through and each one's hash value is checked to see if it needs to be moved. Every comparison potentially requires a transcoding. Not only that, but transcoding can trigger a collection, so you always have to go through the Buffer header structure to find each bucket, even when walking along a chain. Assuming the overwhelmingly common case is no transcoding (and hence no collection), this slows hashes by interfering with compiler and hardware optimizations. Memory example: Right now, we seem to use STRINGs with a 'singlebyte' encoding for raw binary data (opaque bitstrings with no assumed semantics). If I understand correctly, the 'singlebyte' encoding is unusable as a canonical encoding because it cannot hold the full semantics of eg a UTF-32 string. So if your hash keys are large binary wads, they get exploded by a factor of 4 when canonicalized to the most likely candidate canonical encoding, UTF-32. Is my understanding of the situation correct? If so, what's the best way to get correct, fast hashes? And as a tertiary concern, can they be small too? Brainstormed possibilities for fixing the correctness bugs and changing time/space utilization: - The obvious solution: always use UTF-32 as the encoding for hash keys. (sometimes faster, sometimes slower, uses more space) - For hashing and/or comparison, go through the string_ord interface to pull out each codepoint in turn. (slow, uses least space) - Associate a canonical encoding with each hashtable. If it is not given on creation, initialize it to the encoding of the first string stored in the hash. When a string with a different encoding is stored in the hash, transcode it. If it cannot be transcoded (because the hash's canonical encoding is incapable of storing the new key), widen the hash's canonical encoding and transcode all previously stored keys. (faster for more things, uses intermediate space) - Add a string_hash entry to the encoding vtable. This is probably a good idea in any case; hashes really have no business looking at their keys' private parts. (faster for some things, no effect on space, more overhead for adding encodings, more likely to stay correct with new encodings) - As above, but assert that the string_hash vtable entry is guaranteed to not trigger a collection. (Preferably by not allocating any memory in the first place!) I think this really only helps the resizing case, since normally there are more string comparisons than hash computations anyway. (faster lookups, fast resizes, complicates encoding implementation) - Provide a string_compare that is guaranteed to not trigger a collection. For example, implement this by providing iterator entries in the encoding vtable, and iterate through the codepoints of the two strings in parallel. But that could be pretty slow too. And you have to put the iterator structures somewhere without triggering a collection! (unknown effect on speed, uses least space) A last comment - with GC, speed and space are more interdependent than usual. The more space you use, the more frequently you need to collect.