Incidentally, in the small alphabet case (and realistically anagrams are 
usually related to word games over small alphabets) there is a neat trick for 
rapid signature computation: map the 26 letters to the first 26 primes (2, 3, 
5, 7, .., 97, 101). Then make the signature of a word the product of the 
primes. If you never overflow the product this is guaranteed to be a valid 
anagram signature by the uniqueness of prime factorization and commutativity of 
multiplication. Note that this **removes sorting from the equation entirely**.

In the worst case, that product can start to overflow a 64-bit integer past 
about 9 letters (since 101^9 =~ 2^60 and 101^10 > 2^66). However, neither worst 
nor average cases really matter since we have a static dictionary of every 
possible case. So, it's actually easy to just test a given dictionary to see if 
any overflows collide in our exact case. While it is easy to defeat by an 
attacker, for "organic/natural" dictionaries this is likely to be very rare 
because 2^64 is much more than the square of the number of words in most 
dictionaries. So, we are in a regime very far from birthday paradox collisions. 
Really, we aren't random, but also really the "load on the address space" is 
much less than the square of the number of words - it is the product of the 
number of non-product overflowing words and product overflowing words which is 
much smaller (assuming most words do not overflow). So, this is really a 
somewhat empirical question about words and dictionaries and letters.

With just a naive 'A' .. 'Z' \--> 2 .. 101 mapping and this dictionary: 
[https://github.com/jesstess/Scrabble/blob/master/scrabble/sowpods.txt](https://github.com/jesstess/Scrabble/blob/master/scrabble/sowpods.txt)
 I got 15_294 words that overflowed (and no collisions!) for 267_751 words. One 
can do better, though. At least in English, letters have a very skewed usage 
distribution. So, we can make overflows even less likely. You can just look 
that frequency up on Wikipedia 
([https://en.wikipedia.org/wiki/Letter_frequency](https://en.wikipedia.org/wiki/Letter_frequency))
 and sort the first 26 primes by that frequency. Doing that with the above 
SOWPODS dictionary, we reduce the number of overflows to 223.

You can do better still with a pre-pass to _measure_ this frequency in the 
dictionary, sort the primes list by that measured frequency. I get 175 
overflows that way. You can do better still by measuring only the frequency of 
letter usage in words _longer than 9 letters_ (where it actually matters to 
contain overflow). I get only 136 overflows (over 100x better than the initial 
15,000) that way. You might be able to do slightly better by measuring the 
frequency of letters in just the 136 overflow words and adjusting, but that is 
more work and upsets the apple cart of the prior measurement perhaps requiring 
iteration. I think that 136 yields some a priori probability of any collision 
of under 2e-12 (under certain not quite right randomness assumptions).

So, while you might worry this "product signature" technique is not applicable 
due to overflow, it probably is in a language where anagrams are interesting 
Applying all these ideas (except the measurement which I did in a 15 line 
Python script for its convenient arbitrary precision arithmetic) in Nim we get: 
    
    
    import strutils,tables,sets #toUpperAscii,Table,HashSet
    
    const prime: array[26, uint] = [  # >.len==9 frequency
      7'u, 59, 29, 31, 2, 67, 47, 53, 5, 101, 73, 23, 43,
      13, 17, 41, 97, 11, 3, 19, 37, 71, 79, 89, 61, 83 ]
    
    proc product(word: string): uint =
      result = 1       #You might worry about overflow, BUT
      for ch in word:  #..long anagrams are so rare it's ok!
        result *= prime[ord(ch) - ord('A')] #223/267751oflow
    
    proc sortByLetter(word: string): string =
      var cnt: array[26, int16]
      for ch in word: cnt[ord(ch) - ord('A')].inc
      for i, c in cnt:
        for n in 0 ..< c: result.add chr(ord('A') + i)
    
    proc pBuildQry(dict="words", query: seq[string]) =
      var anas = initTable[uint, seq[string]]()
      for line in lines(dict):
        let word = line.toUpperAscii
        anas.mgetOrPut(word.product, @[]).add word
      for word in query:
        let word = word.toUpperAscii
        echo word, ":"
        try:
          for ana in anas[word.product]: echo "  ", ana
        except: discard
    
    proc sBuildQry(dict="words", query: seq[string]) =
      var anas = initTable[string, seq[string]]()
      for line in lines(dict):
        let word = line.toUpperAscii
        anas.mgetOrPut(word.sortByLetter, @[]).add word
      for word in query:
        let word = word.toUpperAscii
        echo word, ":"
        try:
          for ana in anas[word.sortByLetter]: echo "  ", ana
        except: discard
    
    proc collisions(dict="words") =
      var t = initTable[uint, seq[string]]()
      for line in lines(dict):
        let word = line.toUpperAscii
        t.mgetOrPut(word.product, @[]).add word.sortByLetter
      for product, sigs in t:
        if sigs.len > 1:
          let set = sigs.toHashSet()
          if set.len > 1: echo "collision: ", set
    
    when isMainModule:  #You need to nimble install cligen
      import cligen     #..for this CLI to work.
      dispatchMulti([pBuildQry], [sBuildQry], [collisions])
    
    
    Run

With that same SOWPODS dictionary, I get build times about 1.6x faster with the 
product signature than the string-sorted-by-letters-with-counting-sort 
signature (129 milliseconds vs 200 milliseconds and 230 ms for algorithm.sort 
instead of counting sort). So, at a round figure about 2X as fast as 
@lagerratrobe's algorithm. Query times are presumably about a similar speed-up

I don't count the measurement time to decide the order of `primes` in that 129 
ms. Honestly, that is kind of an unneeded optimization. I bet you'd have a hard 
time finding an organic dictionary defeating this. While no overflows guarantee 
no collisions, you almost certainly can get no collisions with many overflows. 
It's easy to check as shown above, and you absolutely should. I was mostly just 
curious if I could get down to zero overflows.

Of course, these same "measure the data" ideas could also be used on the 
Unicode case. If the "working set" of letters over the used dictionary is 
smallish like the 26 here and words are short-ish like the at most 15 letters 
here. I suspect in any language where anagrams are interesting to people would 
have those traits and it would be the fastest approach (at least at query 
time). You should, of course, always measure those collisions as above against 
your dictionaries. And it would still be faster in some system-wide sense to 
pre-compute/save the signature table somewhere, maybe right in the executable 
itself with `const` and Nim compile-time execution or otherwise in a data file. 
If you're doing this in a more pre-calculation separated way, that also makes 
it easier to check for collisions and/or avoid them with an optimal order of 
the `primes` array.

Reply via email to