Incidentally, in the small alphabet case (and realistically anagrams are usually related to word games over small alphabets) there is a neat trick for rapid signature computation: map the 26 letters to the first 26 primes (2, 3, 5, 7, .., 97, 101). Then make the signature of a word the product of the primes. If you never overflow the product this is guaranteed to be a valid anagram signature by the uniqueness of prime factorization and commutativity of multiplication. Note that this **removes sorting from the equation entirely**.
In the worst case, that product can start to overflow a 64-bit integer past about 9 letters (since 101^9 =~ 2^60 and 101^10 > 2^66). However, neither worst nor average cases really matter since we have a static dictionary of every possible case. So, it's actually easy to just test a given dictionary to see if any overflows collide in our exact case. While it is easy to defeat by an attacker, for "organic/natural" dictionaries this is likely to be very rare because 2^64 is much more than the square of the number of words in most dictionaries. So, we are in a regime very far from birthday paradox collisions. Really, we aren't random, but also really the "load on the address space" is much less than the square of the number of words - it is the product of the number of non-product overflowing words and product overflowing words which is much smaller (assuming most words do not overflow). So, this is really a somewhat empirical question about words and dictionaries and letters. With just a naive 'A' .. 'Z' \--> 2 .. 101 mapping and this dictionary: [https://github.com/jesstess/Scrabble/blob/master/scrabble/sowpods.txt](https://github.com/jesstess/Scrabble/blob/master/scrabble/sowpods.txt) I got 15_294 words that overflowed (and no collisions!) for 267_751 words. One can do better, though. At least in English, letters have a very skewed usage distribution. So, we can make overflows even less likely. You can just look that frequency up on Wikipedia ([https://en.wikipedia.org/wiki/Letter_frequency](https://en.wikipedia.org/wiki/Letter_frequency)) and sort the first 26 primes by that frequency. Doing that with the above SOWPODS dictionary, we reduce the number of overflows to 223. You can do better still with a pre-pass to _measure_ this frequency in the dictionary, sort the primes list by that measured frequency. I get 175 overflows that way. You can do better still by measuring only the frequency of letter usage in words _longer than 9 letters_ (where it actually matters to contain overflow). I get only 136 overflows (over 100x better than the initial 15,000) that way. You might be able to do slightly better by measuring the frequency of letters in just the 136 overflow words and adjusting, but that is more work and upsets the apple cart of the prior measurement perhaps requiring iteration. I think that 136 yields some a priori probability of any collision of under 2e-12 (under certain not quite right randomness assumptions). So, while you might worry this "product signature" technique is not applicable due to overflow, it probably is in a language where anagrams are interesting Applying all these ideas (except the measurement which I did in a 15 line Python script for its convenient arbitrary precision arithmetic) in Nim we get: import strutils,tables,sets #toUpperAscii,Table,HashSet const prime: array[26, uint] = [ # >.len==9 frequency 7'u, 59, 29, 31, 2, 67, 47, 53, 5, 101, 73, 23, 43, 13, 17, 41, 97, 11, 3, 19, 37, 71, 79, 89, 61, 83 ] proc product(word: string): uint = result = 1 #You might worry about overflow, BUT for ch in word: #..long anagrams are so rare it's ok! result *= prime[ord(ch) - ord('A')] #223/267751oflow proc sortByLetter(word: string): string = var cnt: array[26, int16] for ch in word: cnt[ord(ch) - ord('A')].inc for i, c in cnt: for n in 0 ..< c: result.add chr(ord('A') + i) proc pBuildQry(dict="words", query: seq[string]) = var anas = initTable[uint, seq[string]]() for line in lines(dict): let word = line.toUpperAscii anas.mgetOrPut(word.product, @).add word for word in query: let word = word.toUpperAscii echo word, ":" try: for ana in anas[word.product]: echo " ", ana except: discard proc sBuildQry(dict="words", query: seq[string]) = var anas = initTable[string, seq[string]]() for line in lines(dict): let word = line.toUpperAscii anas.mgetOrPut(word.sortByLetter, @).add word for word in query: let word = word.toUpperAscii echo word, ":" try: for ana in anas[word.sortByLetter]: echo " ", ana except: discard proc collisions(dict="words") = var t = initTable[uint, seq[string]]() for line in lines(dict): let word = line.toUpperAscii t.mgetOrPut(word.product, @).add word.sortByLetter for product, sigs in t: if sigs.len > 1: let set = sigs.toHashSet() if set.len > 1: echo "collision: ", set when isMainModule: #You need to nimble install cligen import cligen #..for this CLI to work. dispatchMulti([pBuildQry], [sBuildQry], [collisions]) Run With that same SOWPODS dictionary, I get build times about 1.6x faster with the product signature than the string-sorted-by-letters-with-counting-sort signature (129 milliseconds vs 200 milliseconds and 230 ms for algorithm.sort instead of counting sort). So, at a round figure about 2X as fast as @lagerratrobe's algorithm. Query times are presumably about a similar speed-up I don't count the measurement time to decide the order of `primes` in that 129 ms. Honestly, that is kind of an unneeded optimization. I bet you'd have a hard time finding an organic dictionary defeating this. While no overflows guarantee no collisions, you almost certainly can get no collisions with many overflows. It's easy to check as shown above, and you absolutely should. I was mostly just curious if I could get down to zero overflows. Of course, these same "measure the data" ideas could also be used on the Unicode case. If the "working set" of letters over the used dictionary is smallish like the 26 here and words are short-ish like the at most 15 letters here. I suspect in any language where anagrams are interesting to people would have those traits and it would be the fastest approach (at least at query time). You should, of course, always measure those collisions as above against your dictionaries. And it would still be faster in some system-wide sense to pre-compute/save the signature table somewhere, maybe right in the executable itself with `const` and Nim compile-time execution or otherwise in a data file. If you're doing this in a more pre-calculation separated way, that also makes it easier to check for collisions and/or avoid them with an optimal order of the `primes` array.