Case Nelson wrote: > Hi there I've just been playing around with some python code and I've > got a fun little optimization problem I could use some help with. > > Basically, the program needs to take in a random list of no more than > 10 letters, and find all possible mutations that match a word in my > dictionary (80k words). However a wildcard letter '?' is also an > acceptable character which increases the worst case time significantly. > So if the letters are ['a','b','c'] check a, b, c, ab, ac, ba, bc, ca, > cb, abc, acb, bac, bca, cab, cba where only a, ba and cab would be > added to the dict of words. If the letters are ['?','?'] check a-z, aa, > ab, ac, ad, ..., az, ba, bb, bc, bd, ..., zz
This appears to be a Computer Science 101 Data Structures and Algorithms question, not a Python question, but here's an answer anyway: You appear to want to find all words that have one or more letters in common with your query or candidate string. Aside: Have you been following the long thread started by the poster who appeared to want to store all possible strings that were _not_ words in a given language but could be generated from its alphabet? Here's a possibility: use a bit mask approach. You attach a bit mask to each word; simple data structure -- a list of 2-tuples, or two parallel lists. !def mask(word): ! m = 0 ! for letter in word: ! m |= 1 << (ord(letter) - ord('a')) ! return m Searching without wildcards: !def nowc_search(candidate, mylistof2tuples): ! candmask = mask(candidate) # treating candidate as str, not list ! for word, mask in mylistof2tuples: ! if mask & candmask: ! # one or more letters in common ! yield word Note: this treats "mississippi" and "misp" the same. If "aa" is in your dictionary, what queries would retrieve it? Depending on your exact requirements, this technique may suit you, or you may want to use it as a fast(?) filter, with the matches it throws up needing further checking. You may need a "count number of bits that are set in an int" function. Ref: Fred J. Damerau, "A technique for computer detection and correction of spelling errors", CACM vol 7 number 3, March 1961. Searching with wild cards: your example of query == "??" seems to yield all two-letter words. I'd like to see what you expect for "a?", "?a", "ab?", and "aa?" before suggesting how to tackle wild cards. Reverse-engineering requirements out of other folks' code is not something I do for fun :-) An alternative for you to think about: instead of a bitmask, store the letter-sorted transformation of the words: cat->act, act->act, dog->dgo, god->dgo. Alternative data structure: key = bitmask or sorted-letters, value = list of all words that have that key. A further suggestion which should always be considered when setting up a search where the timing is worse than average O(1): have a separate dictionary for each different wordlength, or some other impossible-length-avoidance filter; that way, with minimum preliminary calculation you can avoid considering words that are so long or so short that they cannot possibly be matches. For example, with approximate matching based on edit distance, if you are searching for a 10-letter word allowing for 2 errors, you can avoid doing the complicated comparison on words shorter than 8 or longer than 12. HTH, John -- http://mail.python.org/mailman/listinfo/python-list