Re: Suggestions for optimization?

```Incidentally, in the small alphabet case (and realistically anagrams are
usually related to word games over small alphabets) there is a neat trick for
rapid signature computation: map the 26 letters to the first 26 primes (2, 3,
5, 7, .., 97, 101). Then make the signature of a word the product of the
primes. If you never overflow the product this is guaranteed to be a valid
anagram signature by the uniqueness of prime factorization and commutativity of
multiplication. Note that this **removes sorting from the equation entirely**.```
```
In the worst case, that product can start to overflow a 64-bit integer past
about 9 letters (since 101^9 =~ 2^60 and 101^10 > 2^66). However, neither worst
nor average cases really matter since we have a static dictionary of every
possible case. So, it's actually easy to just test a given dictionary to see if
any overflows collide in our exact case. While it is easy to defeat by an
attacker, for "organic/natural" dictionaries this is likely to be very rare
because 2^64 is much more than the square of the number of words in most
dictionaries. So, we are in a regime very far from birthday paradox collisions.
Really, we aren't random, but also really the "load on the address space" is
much less than the square of the number of words - it is the product of the
number of non-product overflowing words and product overflowing words which is
much smaller (assuming most words do not overflow). So, this is really a
somewhat empirical question about words and dictionaries and letters.

With just a naive 'A' .. 'Z' \--> 2 .. 101 mapping and this dictionary:
[https://github.com/jesstess/Scrabble/blob/master/scrabble/sowpods.txt](https://github.com/jesstess/Scrabble/blob/master/scrabble/sowpods.txt)
I got 15_294 words that overflowed (and no collisions!) for 267_751 words. One
can do better, though. At least in English, letters have a very skewed usage
distribution. So, we can make overflows even less likely. You can just look
that frequency up on Wikipedia
([https://en.wikipedia.org/wiki/Letter_frequency](https://en.wikipedia.org/wiki/Letter_frequency))
and sort the first 26 primes by that frequency. Doing that with the above
SOWPODS dictionary, we reduce the number of overflows to 223.

You can do better still with a pre-pass to _measure_ this frequency in the
dictionary, sort the primes list by that measured frequency. I get 175
overflows that way. You can do better still by measuring only the frequency of
letter usage in words _longer than 9 letters_ (where it actually matters to
contain overflow). I get only 136 overflows (over 100x better than the initial
15,000) that way. You might be able to do slightly better by measuring the
frequency of letters in just the 136 overflow words and adjusting, but that is
more work and upsets the apple cart of the prior measurement perhaps requiring
iteration. I think that 136 yields some a priori probability of any collision
of under 2e-12 (under certain not quite right randomness assumptions).

So, while you might worry this "product signature" technique is not applicable
due to overflow, it probably is in a language where anagrams are interesting
Applying all these ideas (except the measurement which I did in a 15 line
Python script for its convenient arbitrary precision arithmetic) in Nim we get:

import strutils,tables,sets #toUpperAscii,Table,HashSet

const prime: array[26, uint] = [  # >.len==9 frequency
7'u, 59, 29, 31, 2, 67, 47, 53, 5, 101, 73, 23, 43,
13, 17, 41, 97, 11, 3, 19, 37, 71, 79, 89, 61, 83 ]

proc product(word: string): uint =
result = 1       #You might worry about overflow, BUT
for ch in word:  #..long anagrams are so rare it's ok!
result *= prime[ord(ch) - ord('A')] #223/267751oflow

proc sortByLetter(word: string): string =
var cnt: array[26, int16]
for ch in word: cnt[ord(ch) - ord('A')].inc
for i, c in cnt:
for n in 0 ..< c: result.add chr(ord('A') + i)

proc pBuildQry(dict="words", query: seq[string]) =
var anas = initTable[uint, seq[string]]()
for line in lines(dict):
let word = line.toUpperAscii
for word in query:
let word = word.toUpperAscii
echo word, ":"
try:
for ana in anas[word.product]: echo "  ", ana

proc sBuildQry(dict="words", query: seq[string]) =
var anas = initTable[string, seq[string]]()
for line in lines(dict):
let word = line.toUpperAscii
for word in query:
let word = word.toUpperAscii
echo word, ":"
try:
for ana in anas[word.sortByLetter]: echo "  ", ana

proc collisions(dict="words") =
var t = initTable[uint, seq[string]]()
for line in lines(dict):
let word = line.toUpperAscii
for product, sigs in t:
if sigs.len > 1:
let set = sigs.toHashSet()
if set.len > 1: echo "collision: ", set

when isMainModule:  #You need to nimble install cligen
import cligen     #..for this CLI to work.
dispatchMulti([pBuildQry], [sBuildQry], [collisions])

Run

With that same SOWPODS dictionary, I get build times about 1.6x faster with the
product signature than the string-sorted-by-letters-with-counting-sort
signature (129 milliseconds vs 200 milliseconds and 230 ms for algorithm.sort
instead of counting sort). So, at a round figure about 2X as fast as
@lagerratrobe's algorithm. Query times are presumably about a similar speed-up

I don't count the measurement time to decide the order of `primes` in that 129
ms. Honestly, that is kind of an unneeded optimization. I bet you'd have a hard
time finding an organic dictionary defeating this. While no overflows guarantee
no collisions, you almost certainly can get no collisions with many overflows.
It's easy to check as shown above, and you absolutely should. I was mostly just
curious if I could get down to zero overflows.

Of course, these same "measure the data" ideas could also be used on the
Unicode case. If the "working set" of letters over the used dictionary is
smallish like the 26 here and words are short-ish like the at most 15 letters
here. I suspect in any language where anagrams are interesting to people would
have those traits and it would be the fastest approach (at least at query
time). You should, of course, always measure those collisions as above against
your dictionaries. And it would still be faster in some system-wide sense to
pre-compute/save the signature table somewhere, maybe right in the executable
itself with `const` and Nim compile-time execution or otherwise in a data file.
If you're doing this in a more pre-calculation separated way, that also makes
it easier to check for collisions and/or avoid them with an optimal order of
the `primes` array.
```